SlideShare une entreprise Scribd logo
1  sur  124
© Insight 2014. All Rights Reserved
Knowledge Processing with Big Data and
Semantic Web Technologies
Ali Hasnain, Narumol Prangnawarat, Stefan Decker, Naoise Dunne
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Presenters and Contributors
Ali Hasnain
Stefan Decker
Narumol Prangnawarat
Naoise Dunne
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Agenda
• Motivation
• Infrastructure
• Data Curation
• Query Federation
• Analyze
• Visualization
• Hands On Session
© Insight 2014. All Rights Reserved
Session 0: Motivation
The Web is evolving...
WWW (Tim Berners-
Lee)
“There was a second
part of the dream […]
we could then use
computers to help us
analyse it, make sense
of what we re doing,
where we individually
fit in, and how we can
better work together.”
© Insight 2014. All Rights Reserved
A Network of Knowledge
● Interconnected
● Universal
● All encompassing
● assists humans, organisations
and systems with problem
solving
● enabling innovation and
increased productivity
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved 7 of 46
Two Key Ingredients
1. RDF – Resource Description Framework
Graph based Data – nodes and arcs
• Identifies objects (URIs)
• Interlink information (Relationships)
2. Vocabularies (Ontologies)
• provide shared understanding of a domain
• organise knowledge in a machine-comprehensible way
• give an exploitable meaning to the data
Why Graphs?
Cities:Dublin
84421km2
Geo:IslandOfIreland
EU:RepublicOfIreland
Geo:locatedOn
Geo:area
Geo:hasCapital
Geo:hasLargestCity
Wikipedia.org
Gov.ie
EU:RepublicOfIreland
Person:EndaKenny
Gov:hasTaoiseach
Gov:hasDepartment
IE:DepartmentOfFinance
Why Graphs?
Cities:Dublin
84421km2
Geo:IslandOfIreland
EU:RepublicOfIreland
Geo:locatedOn
Geo:area
Geo:hasCapital
Geo:hasLargestCity
Wikipedia.org
Gov.ie
EU:RepublicOfIreland
Person:EndaKenny
Gov:hasTaoiseach
Gov:hasDepartment
IE:DepartmentOfFinance
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Why Graphs?
TGFβ-3
transforming growth
factor, beta 3
Homo
sapiens
CCCCGGCGCAGCGCGGCCGCA
GCAGCCTCCGCCCCCCGCACGG
TGTGAGCGCCCGACGCGGCCG
AGGCGG …
14q24
nci:has_description
nih:sequence
nih:organism
nih:location
nih:organism
TGFβ-3
Platelet activation,
signalling,aggregation
Response to
elevated platelet
cytosol Ca2+
Platelet degranulation
rea:process
rea:process
rea:process
Gene
Database
Pathway
Database
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Why Graphs?
TGFβ-3
transforming growth
factor, beta 3
Homo
sapiens
CCCCGGCGCAGCGCGGCCGCA
GCAGCCTCCGCCCCCCGCACGG
TGTGAGCGCCCGACGCGGCCG
AGGCGG …
14q24
nci:has_description
nih:sequence
nih:organism
nih:location
nih:organism
TGFβ-3
Platelet activation,
signalling,aggregation
Response to
elevated platelet
cytosol Ca2+
Platelet degranulation
rea:process
rea:process
rea:process
Gene
Database
Pathway
Database
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Linked Open Data Cloud
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Life Sciences….
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Cultural Institutions...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Open Government Data...
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Legacy Data Sources….
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
How to analyse these data?
Heterogeneity in
• Data: Different formats
• Domains: How to cross discipline borders?
• Users: Life Science Data needs different analysis and
visualisation than Cultural Data
An analysis tool for each domain?
© Insight 2014. All Rights Reserved
Networked Data
Management
Abstraction,
Reasoning,
Analytics
Visualisation,
Collaboration,
Exploitation
Reusable Infrastructure: Knowledge Pipeline
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Challenges for a Knowledge Pipeline
• How to ingest data sets
• How to automate and scale processing
• How to realise large scale Linked Data processing
• How to analyse large data sets
• How to visualize large datasets
• How to combine different components
© Insight 2014. All Rights Reserved
Session 1: Infrastructure
trends in big linked data infrastructure
© Insight 2014. All Rights Reserved
Introduction
Cloud computing frameworks tailored for managing and
analyzing big data-sets are powering ever larger clusters of
computers.
This presentation will describe the infrastructure that is
required to serve a linked data flavour of big data
© Insight 2014. All Rights Reserved
What is Big Data?
“Big Data is characterized by High Volume, Velocity and
Variety requiring specific Technology and Analytical
Methods for its transformation into Value” - Gartner
• Volume - Data is too big to fit on even largest server
• Velocity - Data need handling based on speed
• Variety - heterogeneous Data - comes in many forms
• Veracity - (sometimes) Quality of the data
© Insight 2014. All Rights Reserved Infographic summerizing 4 Vs
© Insight 2014. All Rights Reserved
The qualities of a big data Infrastructure
• Distributed
• Data and its processing is shared on large cluster multiple cheap commodity servers
• loose coupling, isolation, location transparency, data locality & app-level composition
• High Utilisation
• The infrastructure give the best use of computing resources
• Resilience (handling failure)
• The infrastructure stays responsive in the face of failure and can “heal”
• Scalability
• grows to meet demand (Elastic), responsive under varying workload (Load balanced)
• Operationally efficient
• Needs to be highly automated, be very easy to maintain
© Insight 2014. All Rights Reserved
Distributed
High Utilisation & Scalability
The Rise of distributed Datacenter Schedulers
© Insight 2014. All Rights Reserved
Why Datacenter schedulers?
Schedulers run your Distributed Apps
• are an operating system kernel for the cloud
• Schedulers coordinate execution of work on cluster
• help you to get as many compute resources as you
want whenever you want it
• Abstract some scalability and load balancing issues
© Insight 2014. All Rights Reserved
Benefits of using a Scheduler
• Efficiency - best use of computing resources
• Agility - change your application mix with no turnaround
• Scalability - grow to the current demand of your app
• Modularity - 2 level schedulers have plugin frameworks
that allow quick repurposing of core and no reliance on
one vendor (more later)
© Insight 2014. All Rights Reserved
Datacenter schedulers
Schedulers help you focus on your own work and not the
infrastructure.
“its great to be able to focus on what it is you want to be
doing rather than worrying about how do you get what it is
you need in order to be able to get stuff done”
- John Wilkes (Google)
© Insight 2014. All Rights Reserved Quick history of distributed schedulers
2004 mapreduce
paper
2004 Google
Borg
2011 Hadoop1.0
2003 Google
filesystem
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
2008 Hadoop
released
2013 Yarn
2010 Spark
Paper
2010 Nexus
(Mesos)
2005 Hadoop
started
2013 Mesos
Released
2011 Mesos
Paper
2014 Kubernetes
2014 Google
Omega paper
History of
Datacenter Schedulers
2003 Slurm
© Insight 2014. All Rights Reserved
Hadoop
Monolithic scheduler: Original
opensource datacenter scheduler
• jobs are batched and executed
• Designed only to run
Mapreduce jobs
• No concurrency between apps
• Evolving into yarn
Hadoop
Linux Server Linux Server
hadoop- resource management
mesos slavemesos slave
Linked
datat m/r
job
Linked
datat m/r
job
Linked
datat app
© Insight 2014. All Rights Reserved
Mesos
2 level scheduler : More flexible
• Can Schedule many kinds of applications
• Frameworks (such as spark) are delegated
the per application scheduling
• Mesos responsible for resource distribution
between applications and enforcing overall
fairness
• Very modular, due to 2 level scheduling.
frameworks manage apps as they like
Mesos
Linux Server Linux Server
Mesos - resource management
Mesos - scheduler jobs
framework
chronos
mesos slave
framework
spark
framework
marathon
mesos slave
Hadoop
M/R job
Linked data
job
Linked
datat app
© Insight 2014. All Rights Reserved
Schedulers Recap
Scheduler allow you to use your cluster as one machine
• ease operations
• provide elasticity and load balancing
• Can run both batch and longer running jobs
• Are Efficient, Agile, scalable and modular
© Insight 2014. All Rights Reserved
Managing Failure in
Infrastructure
“Everything fails all the time”
Werner Vogels (CTO Amazon)
© Insight 2014. All Rights Reserved
Handling Failure
The harsh reality:
all distributed infrastructures must deal with failure
for the designers of applications running on distributed
infrastructure (even Mesos), a great number of design
mistakes need to be avoided...
© Insight 2014. All Rights Reserved
Fallacies of distributed computing
• The most common misconceptions that lead to failure of Network
infrastructure:
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesnt change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous
All prove to be false in the long run and all cause big trouble and painful learning experiences.
© Insight 2014. All Rights Reserved
How successful infrastructure avoids failure
No Single point of failure
multiple masters, multiple copies of data, and redundancy on all services. Use elections between
masters, use distributed locks and thresholds.
Design Applications to Expect Failure
Applications should continue to function even if the underlying physical hardware fails.
Evolving infrastructure eases this by the use schedulers and container managers.
Special Channel for failures
Provides the means to delegate errors as messages on their own channel or service. Techniques
include log aggregation and shared monitoring services.
© Insight 2014. All Rights Reserved
Operationally efficient
Using Containers for better Quality of Service
© Insight 2014. All Rights Reserved
Containers
Containers run your application in isolation in a portable
and repeatable fashion
“Because of the way that ... containers separate the
application constraints from infrastructure concerns, we
help solve that dependency hell.” - Docker
© Insight 2014. All Rights Reserved
Why Containers
Containers Are:
• Small (footprint)
• Portable
• Fast
Containers Allow:
• Resiliency
• can be redeployed in seconds
• Operationally efficiency
• The infrastructure can “heal”
• Scalability
• on a single server allows resources (such
as CPU) to be dialed up or down
© Insight 2014. All Rights Reserved
Containers vs. VMs
Virtual Machines
emulate a virtual hardware,
require considerable
overhead in CPU, Disk and
Memory
Containers
use shared operating
systems much more
efficient than hypervisors in
system resource terms
© Insight 2014. All Rights Reserved
Containers vs. VMs
© Insight 2014. All Rights Reserved
Containers
But which container to use.
Many choices exist…
LXC, Docker, Rocket
© Insight 2014. All Rights Reserved
Docker - our container of choice
Why Docker?
• Best of breed at the moment
• integrates natively with Mesos and Kubernetes
• Has great infrastructure including
• Docker registry for looking up containers
• Docker compose for combining containers
Alternatives
• Rockit - More secure as uses init.d but Linux only
© Insight 2014. All Rights Reserved
Container Standardisation
But choosing a container is not a commitment...
Looks like standardisation around the corner via open
container:
https://www.opencontainers.org/
Initiative Sponsors: Apcera, AT&T, AWS, Cisco, ClusterHQ, CoreOS, Datera, Docker, EMC, Fujitsu, Google, Goldman Sachs,
HP, Huawei, IBM, Intel, Joyent, Kismatic, Kyup, the Linux Foundation, Mesosphere, Microsoft, Midokura, Nutanix, Oracle,
Pivotal, Polyverse, Rancher, Red Hat, Resin.io, Suse, Sysdig, Twitter, Verizon, VMWare
© Insight 2014. All Rights Reserved
Recap: Containers
Containers Are:
• Small (footprint), Portable, Fast
• Allow you to repeatedly deploy applications
• Work well with schedulers such as Mesos
• Help with Resiliency and scalability
© Insight 2014. All Rights Reserved
Putting it all together
Insights Linked Data infrastructure
© Insight 2014. All Rights Reserved
Linked Data Infrastructure
Mesos
Mesos - scheduler short jobs Mesos - scheduler long run jobs
Spark Fwk
Chronos
Fwk
Marathon Framework
OS
Monitor
Mesos
Monitor
Linux Server Linux Server Linux Server
Mesos - resource management
mesos
client
Docker
mesos
client
Docker
mesos
client
Docker
Resources
cpu mem disk
Managed by
Mesos
Applications work
with frameworks to
get resources they
need
Frameworks
Negotiate with
mesos to run their
jobs
Datastores
HDT, Neo4JgraphX Granatum RevealedGraph
Jobs
Docker
manages
isolation on
Linux servers
© Insight 2014. All Rights Reserved
Linked Data Infrastructure
Mesos
Linux Server Linux Server Linux Server
Mesos - resource management
Mesos - scheduler short jobs Mesos - scheduler long run jobs
Spark Fwk
Chronos
Fwk
Datastores
HDT, Neo4J
Marathon
graphX
mesos
client
Docker
OS
Monitor
Mesos
Monitor
mesos
client
Docker
mesos
client
Docker
We use graph
X for large
graph batch
jobs
We use both
HDT(RDF Store)
Neo4J (Graph)
Granatum Revealed
We deploy
specialised linked
data applications
to cluster
© Insight 2014. All Rights Reserved
Recap
To provide linked data at scale and with the right service
mix, infrastructures need to consider:
• Services to help application to be Distributed Scalable
• High Utilisation of computing resources
• Know how you will handle failure
• Operationally efficient
We suggest, using schedulers such as mesos with containers (Docker), use
suitable frameworks (GraphX/Spark) and datastores (Neo4J)
© Insight 2014. All Rights Reserved
Data Curation
Ali Hasnain, Narumol Prangnawarat, Naoise Dunne
© Insight 2014. All Rights Reserved
Intro
We will discuss:
•Serialisation formats for RDF
•Converting between these…
•Mapping to conventional data
• D2RQ
• TARQL
© Insight 2014. All Rights Reserved
Serialisation Formats
© Insight 2014. All Rights Reserved
RDF Serialization formats - W3C Standards
• Turtle
• a compact, human-friendly format.
• N-Triples, N-Quads
• a simple, easy-to-parse, line-based format that is not as compact as
Turtle. Nquads: superset of N-Triples, for multiple RDF graphs
• JSON-LD,
• a JSON-based serialization
• RDF/XML,
• first standard format for serializing RDF.
© Insight 2014. All Rights Reserved
N triples unreadable
<http://www.w3.org/2001/sw/RDFCore/ntriples/>
<http://purl.org/dc/elements/1.1/creator> "Dave Beckett" .
<http://www.w3.org/2001/sw/RDFCore/ntriples/>
<http://purl.org/dc/elements/1.1/creator> "Art Barstow" .
<http://www.w3.org/2001/sw/RDFCore/ntriples/>
<http://purl.org/dc/elements/1.1/publisher> <http://www.w3.org/> .
© Insight 2014. All Rights Reserved
RDF Serialization formats - Non Standard
Non standard but popular formats
• N3 or Notation3,
• a non-standard serialization that is very similar to Turtle, but has some
additional features, such as the ability to define inference rules
• HDT
• Compressed Binary RDF, HDT compresses big RDF datasets while
maintaining search and browse operations
• Microformats
• Similar to RDF, and can be converted RDF. Uses html pages as both a
human readable document and machine readable data, very big on
web
© Insight 2014. All Rights Reserved
Converting between standards
Any23
• Created at Deri (Insight) for converting popular
serialisation formats.
• Now First class apache project
• Online converter http://any23-vm.apache.org/
© Insight 2014. All Rights Reserved
Mapping traditional data to Linked Data
Two very popular tools created at Insight (Deri)
D2RQ - http://d2rq.org/
Maps relational databases to RDF
TARQL
Maps tables (csv) to RDF
© Insight 2014. All Rights Reserved
Why Tarql
Very simple mapping syntax.
• Most of the world’s structured data is stored as tables
• Most RDBMS database tables can be denormalized to a
single table
• Data cleansing can be an earlier step and make use of
best in case for tabular data
• Compared to tools such as D2RQ very easy to learn and
use
© Insight 2014. All Rights Reserved
TARQL
General structure of TARQL
mapping:
• Normal SPARQL select but has
From file parameter
• Can work on selects and
constructs.
SELECT DISTINCT ?id ?name
FROM <file:filename.csv>
WHERE {}
LIMIT 100
© Insight 2014. All Rights Reserved
TARQL - Select
Looks very like normal SPARQL, in
where clause
• Special Bind statements that
bind column name with graph
construct
SELECT ...
WHERE {
BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
BIND (STRLANG(?a, 'en') AS ?with_language_tag)
}
© Insight 2014. All Rights Reserved
TARQL -Construct
Looks very like normal SPARQL, in where
clause
• Special Bind statements that bind
column name with graph construct
CONSTRUCT {
?URI a ex:Organization;
ex:name ?NameWithLang;
ex:CIK ?CIK;
ex:LEI ?LEI;
ex:ticker ?Stock_ticker;
}
FROM <file:companies.csv>
WHERE {
BIND (URI(CONCAT('companies/', ?Stock_ticker)) AS ?URI)
BIND (STRLANG(?Name, "en") AS ?NameWithLang)
}
© Insight 2014. All Rights Reserved
Tarql
Any23
• Created at Insight (DERI) for converting popular
serialisation formats.
• Online converter http://any23-vm.apache.org/
© Insight 2014. All Rights Reserved
Accessing Data- Query Federation
Ali Hasnain, Narumol Prangnawarat, Naoise Dunne
© Insight 2014. All Rights Reserved
SPARQL Query Federation Approaches
••SPARQL Endpoint Federation (SEF)
••Linked Data Federation (LDF)
••Distributed Hash Tables (DHTs)
••Hybrid of SEF+LDF
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
SPARQL Query Federation Approaches
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
SPARQL Endpoint Federation Approaches
• Most commonly used approaches
• Make use of SPARQL endpoints URLs
• Fast query execution
• RDF data needs to be exposed via SPARQL endpoints
• E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD etc.
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
Linked Data Federation Approaches
Data needs not be exposed via SPARQL endpoints
Uses URI lookups at runtime
Data should follow Linked Data principles
Slower as compared to previous approaches
E.g., LDQPS, SIHJoin, WoDQA etc.
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
Query federation on top of Distributed Hash
Tables
•Uses DHT indexing to federate SPARQL queries
•Space efficient
•Cannot deal with whole LOD
•E.g., ATLAS
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
Hybrid of SEF+LDF
Federation over SPARQL endpoints and Linked Data
Can potentially deal with whole LOD
E.g., ADERIS-Hybrid
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
SPARQL Endpoint Federation
S1 S2 S3 S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimzer
Integrator
Rewrite query and
get Individual Triple
Patterns
Identify capable
source against
Individual Triple
Patterns
Generate optimized
sub-query Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection
Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
Source Selection
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection
Algorithm
Triple pattern-wise source selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1
Source Selection
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection
Algorithm
Triple pattern-wise source
selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1 TP4 = S4
Source Selection
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality dbpedia:United_States .
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
}
dbpedia
RDF
Source Selection
Algorithm
Triple pattern-wise source
selection
S1TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
SWDF
RDF
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9
//TP1
//TP3
//TP4
//TP5
//TP2
TP2 = S1
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4-S9
Source Selection
Total triple pattern-wise sources selected
= 1+1+1+1+8 => 12
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
SPARQL Query Federation Engine
•FedX
•SPLENDID
•HiBISCuS+FedX
•HiBISCuS+SPLENDID
•ANAPSID
•BioFed
•LHD
•DARQ
Curtsey Muhammad Saleem (AKSW)
© Insight 2014. All Rights Reserved
Overview of Implementation details of Federated
Sparql Query Engines
© Insight 2014. All Rights Reserved
System Features of Federated Sparql Query
Engines
© Insight 2014. All Rights Reserved
System’s Support for SPARQL Query Construct
© Insight 2014. All Rights Reserved
Analyze of Linked Data at scale
Narumol Prangnawarat, Ali Hasnain, Naoise Dunne
© Insight 2014. All Rights Reserved
Linked Data
A method of publishing structured
data so that it can be interlinked.
• Normally represented as RDF
• Normally queried using SPARQL
4 principles of linked data
1. Use URIs to name (identify) things.
2. Use HTTP URIs so can be looked up
3. Provide metadata about what thing is.
- use open standards RDF, SPARQL, etc.
4. Link to other things using their HTTP URI-
based names
© Insight 2014. All Rights Reserved
How do I query Linked
data
© Insight 2014. All Rights Reserved
Linked data as a graph
Many approaches to query and reason over linked data
exist
the most popular query language in the RDF community is
sparql, but alternatives exist…
… if we think about linked data as a graph
© Insight 2014. All Rights Reserved
Linked data as a graph
As Linked data graphs share graph structure and can be
connected we can reason over them using graph
algorithms.
Popular approaches to querying graphs are:
• Declarative pattern matching - SPARQL
• graph traversal languages - Cypher, Gremlin
• distributed graph data structures - GraphX
© Insight 2014. All Rights Reserved
Comparing Graph Query
Approaches
© Insight 2014. All Rights Reserved
Comparing Graph Query Approaches
Cypher - graph traversal
Somewhere between graph
traversal and pattern matching runs
on Neo4J
Characteristics
• Property graph
• index-free adjacency
• uses adjacency tables
Pros
great at localized searches
(shortest path for instance)
is a database
Is Expressive language
Cons
Poor at aggregation
Graph X - message passing
DSL rather than language. For
“programmers” Scala, Java and
Python APIs.
Characteristics
• Resilient Distributed Property
Graph
• index-free adjacency
• Vertex/Edge table
Pros
Best (in list) distributed execution
Powerful mix of table and traversal
has a set of optimised operators
Cons
Not a database, data needs to be
loaded and stored separately
SPARQL -declarative
Declarative language, allowing better
expression and abstracting the writer
from optimisation problems
Characteristics
Stores Triples (tuples)
Vendor normally optimise popular
queries.
Often built on RDBMS
Pros
Is Expressive language
Easy to express search patterns
Cons
Difficult to scale
Poor at traversal
© Insight 2014. All Rights Reserved
Comparing Graph Query Approaches
Cypher - graph traversal
Somewhere between graph
traversal and pattern matching runs
on Neo4J
Graph X - message passing
DSL rather than language. Scala,
Java and Python APIs.
SPARQL -declarative
is a declarative language, allowing
better expression and abstracting the
writer from optimisation problems
Abstract Concrete
Considerable work (for vendors) to scale Little work to scale
Optimised for aggregation Optimised for connections
Global Local
© Insight 2014. All Rights Reserved
Other Graph Query Languages
Cypher -
graph traversal
Somewhere between graph
traversal and pattern matching runs
on Neo4J
Alternatives
Gremlin
A “pure” graph traversal language
based on xpath
Network X
A python DSL for graphs
Graph X -
Message passing
DSL rather than language. For
“programmers” Scala, Java and
Python APIs.
Alternatives
GraphLab
very powerful commercial product
Giraffe
A hadoop API for graphs
SPARQL -
Declarative
Declarative language, allowing better
expression and abstracting the writer
from optimisation problems
Alternatives
Cypher(!)
Has pattern matching constructs, can
do much the same as sparql on
Neo4J database
© Insight 2014. All Rights Reserved
Sparql at big data scale
What about using SPARQL in distributed big data
infrastructure?
© Insight 2014. All Rights Reserved
Sparql at big data scale
At big data scales and in distributed infrastructure SPARQL
quickly becomes an impediment
Why?
• It is difficult to optimise SPARQL at scale with fast data
• As optimisations are embedded in query, optimised
SPARQL queries become less natural and hard to write.
• SPARQL abstractions “leak”, leading to “hacks” of big
data RDF infrastructure
© Insight 2014. All Rights Reserved
Why use Graph algorithms
Why not Sparql
• For distributed computing,
declarative languages such as
sparql are (for now) problematic
• you cannot know if query is NP or
EXP time.
• Difficult to create query plans
especially with fast or changing
graphs
• Have to rely on federation which
is still being researched
Why Graph Traversal
• Proven to scale
• Graphs traversal lower level but
easier to tune so that it is in P
time.
• Most popular algorithm are local
(as in) only query neighbouring
nodes at any time -thus easier to
break up across compute nodes
© Insight 2014. All Rights Reserved
Scaling Sparql
So you still want to use sparql at scale
How do we query RDF data with SPARQL at Big data scale?
Clustering
• Some vendors/platforms provide clustered triple-stores
Federation
Became available in Sparql 1.1 with SERVICE keyword
• Federation emerged with great promises.
• For distributed computing, sparql federation has limitations (for now).
• Query planning for SPARQL is NP Complete
© Insight 2014. All Rights Reserved
Alternatives to Scaling Sparql
Fortunately, both Graph traversal and GraphX approaches
to linked data that work at scale today.
We will look at these approaches now.
© Insight 2014. All Rights Reserved
Analysing Linked Data as a
Graph
Querying linked data at scale
with complementary technologies to sparql
© Insight 2014. All Rights Reserved
Linked data as a graph
Linked data can be considered a Heterogeneous Graph
What do we mean by Heterogeneous Graph?
• Linked data graphs share graph structure but have mixed characteristics
• Nodes (Vertices) contain different data
• Links (Edges) Mix of directed and undirected
• Linked Data Graph can be weighted or not
• Can have a mix of classes and types from differing Ontologies
What do these graphs look like?
© Insight 2014. All Rights Reserved graphs within a Heterogeneous Graph
© Insight 2014. All Rights Reserved
Linked data as a graph
As Linked data graphs share graph structure and can be
connected we can reason over them using graph
algorithms using
Popular approaches are:
• graph traversal languages - Cypher, Gremlin
• distributed graph data structures - GraphX
Declarative pattern matching - SPARQL, but can be hard to scale
© Insight 2014. All Rights Reserved
Cypher graph language
Some basic cypher...
Find the actor named "Tom Hanks"...
MATCH (tom {name: "Tom Hanks"}) RETURN tom
Who directed "Cloud Atlas"?
MATCH (cloudAtlas {title: "Cloud Atlas"})<-[:DIRECTED]-(directors)
RETURN directors.name
© Insight 2014. All Rights Reserved
Shortest Path
Shortest path is the problem of finding a path between
two vertices (or nodes) in a graph with the lowest weight
(this could be cost or distance).
© Insight 2014. All Rights Reserved
What is Dijkstra
An approach for shortest path, using traversal, that
changes complexity of problem from NP to P time
naive shortest path between 2 points has complexity of:
O(|V³|)
Dijkstra approach we get a worst case complexity of:
O(|E| + |V| log |V|)
© Insight 2014. All Rights Reserved
Dijkstra using Cypher
MATCH p= shortestPath(
(bacon:Person {name:"Kevin Bacon"})-[*]-
(meg:Person {name:"Meg Ryan"}))
RETURN p
Find the shortest path between a Person with name Kevin Bacon and Meg Ryan
DEMO
© Insight 2014. All Rights Reserved Cypher Shortest path
© Insight 2014. All Rights Reserved
Another useful algo. Community detection
What is community Detection
a Graph is said to have community
structure if the nodes of the network
can be easily grouped into
(potentially overlapping) sets of
nodes such that each set of nodes is
densely connected internally.
© Insight 2014. All Rights Reserved
Visualisation tools for large graphs
On larger datasets Neo4js UI is too unresponsive for large
result sets from queries such as community detection.
• instead you may want to use a visualisation tool such
as Gephi.
• We will now demonstrate Louvain Community Detection
using the graph visualisation tool Gephi.
© Insight 2014. All Rights Reserved
Using Gephi
Here is a graph that we calculated using Luvian earlier...
This example shows clustering around Topics discussed on twitter:
• graph cluster tweets around the topic under discussion
• Each retweet or reply creates a link in the graph
© Insight 2014. All Rights Reserved Gephi Screenshot
© Insight 2014. All Rights Reserved
Using Gephi
Here is a graph that we calculated using Luvian earlier...
This example shows clustering around Topics discussed on twitter:
• we cluster tweets around the topic discussion
• Each retweet or reply creates a link in the graph
Once the Luvian analysis is complete, we load the results to gephi and we gain
new insights on visualising the clusters around these topics:
• Compared to text analysis we are able to detect deeper community
structures from the retweets and replies to tweets
• This gives us a deeper understanding of the individuals and communities
© Insight 2014. All Rights Reserved
Sparql Federation example
Q: How are the protein targets of the
gleevec drug differentially expressed,
which pathways are they involved in?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/>
PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT distinct ?dbXref (str(?pathwayname) as ?pathname) ?factorLabel
WHERE {
# query chembl for gleevec (CHEMBL941) protein targets
?act a cco:Activity;
cco:hasMolecule chembl_molecule:CHEMBL941 ;
cco:hasAssay ?assay .
?assay cco:hasTarget ?target .
?target cco:hasTargetComponent ?targetcmpt .
?targetcmpt cco:targetCmptXref ?dbXref .
?targetcmpt cco:taxonomy .
?dbXref a cco:UniprotRef
# query for pathways by those protein targets
SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {
?protein rdf:type biopax3:Protein .
?protein biopax3:memberPhysicalEntity
[biopax3:entityReference ?dbXref] .
?pathway biopax3:displayName ?pathwayname .
?pathway biopax3:pathwayComponent ?reaction .
?reaction ?rel ?protein .
}
# get Atlas experiment plus experimental factor where protein is expressed
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?probe atlasterms:dbXref ?dbXref .
?value atlasterms:isMeasurementOf ?probe .
?value atlasterms:hasFactorValue ?factor .
?value rdfs:label ?factorLabel .
}
}
© Insight 2014. All Rights Reserved
Why use Graph algorithms
Polynomial time
A lot of queries on linked data can be expressed as well known graph traversal algo. that work in P
time such as Dijkstra for positive weighted directed graph. Because these alog are localised suit
distributed computed.
© Insight 2014. All Rights Reserved
More pure Dijkstra using Cypher
MATCH (from: Location {LocationName:"x"}), (to: Location
{LocationName:"y"}) ,
paths = allShortestPaths((from)-[:CONNECTED_TO*]->(to))
WITH REDUCE(dist = 0, rel in rels(paths) | dist +
rel.distance) AS distance, paths
RETURN paths, distance ORDER BY distance LIMIT 1
The other approach used an inbuilt function - this shows a closer
approximation of the actual algorithm
© Insight 2014. All Rights Reserved
Dijkstra Psudocode
dist[s]←0 (distance to
source vertex is zero)
forall v ∈ V–{s}
do dist[v]←∞ (set all other
distances to infinity)
S←∅ (S,the set of visited vert
is initially empty)
Q←V (Q,the queue
initially contains all vertices)
whileQ≠∅ (while the queue
is not empty)
dou← mindistance(Q,dist) (select the element of Q with the
min.distance)
S←S∪{u} (add u to list of visited
vertices)
forall v ∈ neighbors[u]
do if dist[v]>dist[u]+w(u,v) (if new shortest path found)
then d[v]←d[u]+w(u,v) (set new value of shortest path)
(if desired,add trace back code)
returndist
© Insight 2014. All Rights Reserved
Visualization
Ali Hasnain, Stefan Decker, Naoise Dunne
© Insight 2014. All Rights Reserved
Visualization
• Visualize your Data!
• Available Tools
• ReVeaLD
• FedViz
• Genome Wheel
© Insight 2014. All Rights Reserved
ReVeaLD Search Platform
ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user-
driven domain-specific search platform.
Intuitively formulate advanced search queries using a click-input-select
mechanism
Visualize the results in a domain–suitable format.
Assembly of the query is governed by a Domain Specific Language (DSL),
which in this case is the Granatum Biomedical Semantic Model (CanCO)
© Insight 2014. All Rights Reserved
ReVeaLD Search Platform
Availability: http://n10.soma.insight-centre.org:31005/explorer
Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1
Curtsey Maulik Kamdar
© Insight 2014. All Rights Reserved
DSL Visual Representation
Concept Map Visualization is used.
© Insight 2014. All Rights Reserved
Visual Query Builder
© Insight 2014. All Rights Reserved
Visual Query Model
© Insight 2014. All Rights Reserved
FedViz: A Visual Interface for SPARQL Queries
Formulation and Execution
FedViz is an online application that provides Biologist a flexible visual interface
to formulate and execute both federated and non-federated SPARQL queries.
It translates the visually assembled queries into SPARQL equivalent and
execute using query engine (FedX).
Availability: http://srvgal86.deri.ie/FedViz/index.html
Curtsey Sana e Zainab
© Insight 2014. All Rights Reserved
FedViz: A Visual Interface for SPARQL Queries Formulation
and Execution
© Insight 2014. All Rights Reserved
Using FedViz: Step by Step
© Insight 2014. All Rights Reserved
GenomeSnip Platform
A semantic, visual analytics prototype devised to expedite knowledge exploration and discovery in
cancer research.
Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative,
circular visualization, the
‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’
(linear) display.
Technologies: Web-based client application developed using native technologies like HTML5
Canvas, JavaScript and JSON.
KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering,
caching and event handling.
Availability: http://srvgal78.deri.ie/genomeSnip/
Curtsey Maulik Kamdar
© Insight 2014. All Rights Reserved
Genome Browser
© Insight 2014. All Rights Reserved
Hands On Session
Narumol Prangnawarat, Ali Hasnain, Naoise Dunne
© Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved
Convert CSV File to RDF using TARQL
Instructions Manual
https://goo.gl/xLpF8Y

Contenu connexe

Tendances

Hms crash planitsummit2016
Hms crash planitsummit2016Hms crash planitsummit2016
Hms crash planitsummit2016kevin_donovan
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of DataHortonworks
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectPeak Hosting
 
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell World
 
Enterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to SuccessEnterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to SuccessHortonworks
 
The Website Resiliency Imperative
The Website Resiliency ImperativeThe Website Resiliency Imperative
The Website Resiliency ImperativeDistil Networks
 
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing Demystified
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing DemystifiedNJMGMA Practice Management Conference (PMC2014) - Cloud Computing Demystified
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing DemystifiedExigent Technologies LLC
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDuraSpace
 
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...Dell World
 
Cloudera for Internet of Things
Cloudera for Internet of ThingsCloudera for Internet of Things
Cloudera for Internet of ThingsCloudera, Inc.
 
Storage as a service v4 eng
Storage as a service v4 engStorage as a service v4 eng
Storage as a service v4 engDell EMC
 
Webinar: Is Convergence right for you? – 4 questions to ask
Webinar: Is Convergence right for you? – 4 questions to askWebinar: Is Convergence right for you? – 4 questions to ask
Webinar: Is Convergence right for you? – 4 questions to askStorage Switzerland
 
Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Dell World
 
Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!Dell World
 
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera and Northwestern Mutual  - Scaling Privacy in a Spark EcosystemPrivacera and Northwestern Mutual  - Scaling Privacy in a Spark Ecosystem
Privacera and Northwestern Mutual - Scaling Privacy in a Spark EcosystemPrivacera
 
Operating OpenStack on a Budget
Operating OpenStack on a BudgetOperating OpenStack on a Budget
Operating OpenStack on a BudgetSusan Wu
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 

Tendances (19)

Hms crash planitsummit2016
Hms crash planitsummit2016Hms crash planitsummit2016
Hms crash planitsummit2016
 
The Destiny of Data
The Destiny of DataThe Destiny of Data
The Destiny of Data
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
 
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...Dell High-Performance Computing solutions: Enable innovations, outperform exp...
Dell High-Performance Computing solutions: Enable innovations, outperform exp...
 
Enterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to SuccessEnterprise Data Warehouse Optimization: 7 Keys to Success
Enterprise Data Warehouse Optimization: 7 Keys to Success
 
The Website Resiliency Imperative
The Website Resiliency ImperativeThe Website Resiliency Imperative
The Website Resiliency Imperative
 
Jonathan Bryce - OpenStack
Jonathan Bryce - OpenStackJonathan Bryce - OpenStack
Jonathan Bryce - OpenStack
 
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing Demystified
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing DemystifiedNJMGMA Practice Management Conference (PMC2014) - Cloud Computing Demystified
NJMGMA Practice Management Conference (PMC2014) - Cloud Computing Demystified
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the Cloud
 
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...
Manage easier, deliver faster, innovate more - Top 10 facts on Dell Enterpris...
 
Cloudera for Internet of Things
Cloudera for Internet of ThingsCloudera for Internet of Things
Cloudera for Internet of Things
 
Storage as a service v4 eng
Storage as a service v4 engStorage as a service v4 eng
Storage as a service v4 eng
 
Webinar: Is Convergence right for you? – 4 questions to ask
Webinar: Is Convergence right for you? – 4 questions to askWebinar: Is Convergence right for you? – 4 questions to ask
Webinar: Is Convergence right for you? – 4 questions to ask
 
Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...Give Your Organization Better, Faster Insights & Answers with High Performanc...
Give Your Organization Better, Faster Insights & Answers with High Performanc...
 
Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!Running SQL 2005? It’s time to migrate to SQL 2014!
Running SQL 2005? It’s time to migrate to SQL 2014!
 
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
Privacera and Northwestern Mutual  - Scaling Privacy in a Spark EcosystemPrivacera and Northwestern Mutual  - Scaling Privacy in a Spark Ecosystem
Privacera and Northwestern Mutual - Scaling Privacy in a Spark Ecosystem
 
Operating OpenStack on a Budget
Operating OpenStack on a BudgetOperating OpenStack on a Budget
Operating OpenStack on a Budget
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 

En vedette

Intelligent Business Process Management Suites
Intelligent Business Process Management SuitesIntelligent Business Process Management Suites
Intelligent Business Process Management Suitesqasusbelli
 
Intelligent Business Processes
Intelligent Business ProcessesIntelligent Business Processes
Intelligent Business ProcessesSandy Kemsley
 
Multicast vs unicast diagram
Multicast vs unicast diagramMulticast vs unicast diagram
Multicast vs unicast diagraminternetstreams
 
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013Kai Wähner
 
Next-Generation BPM - How to create intelligent Business Processes thanks to ...
Next-Generation BPM - How to create intelligent Business Processes thanks to ...Next-Generation BPM - How to create intelligent Business Processes thanks to ...
Next-Generation BPM - How to create intelligent Business Processes thanks to ...Kai Wähner
 
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...Kai Wähner
 
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...Kai Wähner
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming PatternMarko Rodriguez
 

En vedette (8)

Intelligent Business Process Management Suites
Intelligent Business Process Management SuitesIntelligent Business Process Management Suites
Intelligent Business Process Management Suites
 
Intelligent Business Processes
Intelligent Business ProcessesIntelligent Business Processes
Intelligent Business Processes
 
Multicast vs unicast diagram
Multicast vs unicast diagramMulticast vs unicast diagram
Multicast vs unicast diagram
 
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013
Alternatives for Systems Integration in the NoSQL Era - NoSQL Roadshow 2013
 
Next-Generation BPM - How to create intelligent Business Processes thanks to ...
Next-Generation BPM - How to create intelligent Business Processes thanks to ...Next-Generation BPM - How to create intelligent Business Processes thanks to ...
Next-Generation BPM - How to create intelligent Business Processes thanks to ...
 
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...
How to create intelligent Business Processes thanks to Big Data (BPM, Apache ...
 
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...
Intelligent Business Process Management Suites (iBPMS) - The Next-Generation ...
 
The Graph Traversal Programming Pattern
The Graph Traversal Programming PatternThe Graph Traversal Programming Pattern
The Graph Traversal Programming Pattern
 

Similaire à Knowledge Processing with Big Data and Semantic Web Technologies

Ontologies for Emergency & Disaster Management
Ontologies for Emergency & Disaster Management Ontologies for Emergency & Disaster Management
Ontologies for Emergency & Disaster Management Stephane Fellah
 
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data DataCentred
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendrapasalapudi123
 
Say no to microservices slideshare
Say no to microservices slideshareSay no to microservices slideshare
Say no to microservices slideshareLykle Thijssen
 
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data Management
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data ManagementCloudera Federal Forum 2014: Hadoop's Impact on the Future of Data Management
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data ManagementCloudera, Inc.
 
Overview of computing paradigm
Overview of computing paradigmOverview of computing paradigm
Overview of computing paradigmRipal Ranpara
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scaleCarolyn Duby
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
IW14 Session: webMethods World
IW14 Session: webMethods WorldIW14 Session: webMethods World
IW14 Session: webMethods WorldSoftware AG
 
Geospatial Ontologies and GeoSPARQL Services
Geospatial Ontologies and GeoSPARQL ServicesGeospatial Ontologies and GeoSPARQL Services
Geospatial Ontologies and GeoSPARQL ServicesStephane Fellah
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...MapR Technologies
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingKnowledgent
 

Similaire à Knowledge Processing with Big Data and Semantic Web Technologies (20)

Ontologies for Emergency & Disaster Management
Ontologies for Emergency & Disaster Management Ontologies for Emergency & Disaster Management
Ontologies for Emergency & Disaster Management
 
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
The Effectiveness, Efficiency and Legitimacy of Outsourcing Your Data
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Dba to data scientist -Satyendra
Dba to data scientist -SatyendraDba to data scientist -Satyendra
Dba to data scientist -Satyendra
 
Say no to microservices slideshare
Say no to microservices slideshareSay no to microservices slideshare
Say no to microservices slideshare
 
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data Management
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data ManagementCloudera Federal Forum 2014: Hadoop's Impact on the Future of Data Management
Cloudera Federal Forum 2014: Hadoop's Impact on the Future of Data Management
 
Overview of computing paradigm
Overview of computing paradigmOverview of computing paradigm
Overview of computing paradigm
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
IW14 Session: webMethods World
IW14 Session: webMethods WorldIW14 Session: webMethods World
IW14 Session: webMethods World
 
Geospatial Ontologies and GeoSPARQL Services
Geospatial Ontologies and GeoSPARQL ServicesGeospatial Ontologies and GeoSPARQL Services
Geospatial Ontologies and GeoSPARQL Services
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Enterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum ComputingEnterprise Architecture in the Era of Big Data and Quantum Computing
Enterprise Architecture in the Era of Big Data and Quantum Computing
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 

Plus de Syed Muhammad Ali Hasnain

Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
 
SHARP: Harmonizing cross-workflow Provenance
SHARP: Harmonizing cross-workflow ProvenanceSHARP: Harmonizing cross-workflow Provenance
SHARP: Harmonizing cross-workflow ProvenanceSyed Muhammad Ali Hasnain
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSyed Muhammad Ali Hasnain
 
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...Syed Muhammad Ali Hasnain
 
An Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between GenesAn Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between GenesSyed Muhammad Ali Hasnain
 
Federated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFedFederated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFedSyed Muhammad Ali Hasnain
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesSyed Muhammad Ali Hasnain
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudSyed Muhammad Ali Hasnain
 
Improving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data CloudImproving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data CloudSyed Muhammad Ali Hasnain
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionSyed Muhammad Ali Hasnain
 

Plus de Syed Muhammad Ali Hasnain (11)

Fair data vs 5 star open data final
Fair data vs 5 star open data finalFair data vs 5 star open data final
Fair data vs 5 star open data final
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
SHARP: Harmonizing cross-workflow Provenance
SHARP: Harmonizing cross-workflow ProvenanceSHARP: Harmonizing cross-workflow Provenance
SHARP: Harmonizing cross-workflow Provenance
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenance
 
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
Exploiting Cognitive Computing and Frame Semantic Features for Biomedical Doc...
 
An Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between GenesAn Approach for Discovering and Exploring Semantic Relationships between Genes
An Approach for Discovering and Exploring Semantic Relationships between Genes
 
Federated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFedFederated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFed
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
 
Improving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data CloudImproving discovery in Life Sciences Linked Open Data Cloud
Improving discovery in Life Sciences Linked Open Data Cloud
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
 

Dernier

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Dernier (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

Knowledge Processing with Big Data and Semantic Web Technologies

  • 1. © Insight 2014. All Rights Reserved Knowledge Processing with Big Data and Semantic Web Technologies Ali Hasnain, Narumol Prangnawarat, Stefan Decker, Naoise Dunne
  • 2. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Presenters and Contributors Ali Hasnain Stefan Decker Narumol Prangnawarat Naoise Dunne
  • 3. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Agenda • Motivation • Infrastructure • Data Curation • Query Federation • Analyze • Visualization • Hands On Session
  • 4. © Insight 2014. All Rights Reserved Session 0: Motivation
  • 5. The Web is evolving... WWW (Tim Berners- Lee) “There was a second part of the dream […] we could then use computers to help us analyse it, make sense of what we re doing, where we individually fit in, and how we can better work together.”
  • 6. © Insight 2014. All Rights Reserved A Network of Knowledge ● Interconnected ● Universal ● All encompassing ● assists humans, organisations and systems with problem solving ● enabling innovation and increased productivity
  • 7. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved 7 of 46 Two Key Ingredients 1. RDF – Resource Description Framework Graph based Data – nodes and arcs • Identifies objects (URIs) • Interlink information (Relationships) 2. Vocabularies (Ontologies) • provide shared understanding of a domain • organise knowledge in a machine-comprehensible way • give an exploitable meaning to the data
  • 10. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Why Graphs? TGFβ-3 transforming growth factor, beta 3 Homo sapiens CCCCGGCGCAGCGCGGCCGCA GCAGCCTCCGCCCCCCGCACGG TGTGAGCGCCCGACGCGGCCG AGGCGG … 14q24 nci:has_description nih:sequence nih:organism nih:location nih:organism TGFβ-3 Platelet activation, signalling,aggregation Response to elevated platelet cytosol Ca2+ Platelet degranulation rea:process rea:process rea:process Gene Database Pathway Database
  • 11. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Why Graphs? TGFβ-3 transforming growth factor, beta 3 Homo sapiens CCCCGGCGCAGCGCGGCCGCA GCAGCCTCCGCCCCCCGCACGG TGTGAGCGCCCGACGCGGCCG AGGCGG … 14q24 nci:has_description nih:sequence nih:organism nih:location nih:organism TGFβ-3 Platelet activation, signalling,aggregation Response to elevated platelet cytosol Ca2+ Platelet degranulation rea:process rea:process rea:process Gene Database Pathway Database
  • 12. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Linked Open Data Cloud
  • 13. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Life Sciences….
  • 14. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Cultural Institutions...
  • 15. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Open Government Data...
  • 16. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Legacy Data Sources….
  • 17. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved How to analyse these data? Heterogeneity in • Data: Different formats • Domains: How to cross discipline borders? • Users: Life Science Data needs different analysis and visualisation than Cultural Data An analysis tool for each domain?
  • 18. © Insight 2014. All Rights Reserved Networked Data Management Abstraction, Reasoning, Analytics Visualisation, Collaboration, Exploitation Reusable Infrastructure: Knowledge Pipeline
  • 19. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Challenges for a Knowledge Pipeline • How to ingest data sets • How to automate and scale processing • How to realise large scale Linked Data processing • How to analyse large data sets • How to visualize large datasets • How to combine different components
  • 20. © Insight 2014. All Rights Reserved Session 1: Infrastructure trends in big linked data infrastructure
  • 21. © Insight 2014. All Rights Reserved Introduction Cloud computing frameworks tailored for managing and analyzing big data-sets are powering ever larger clusters of computers. This presentation will describe the infrastructure that is required to serve a linked data flavour of big data
  • 22. © Insight 2014. All Rights Reserved What is Big Data? “Big Data is characterized by High Volume, Velocity and Variety requiring specific Technology and Analytical Methods for its transformation into Value” - Gartner • Volume - Data is too big to fit on even largest server • Velocity - Data need handling based on speed • Variety - heterogeneous Data - comes in many forms • Veracity - (sometimes) Quality of the data
  • 23. © Insight 2014. All Rights Reserved Infographic summerizing 4 Vs
  • 24. © Insight 2014. All Rights Reserved The qualities of a big data Infrastructure • Distributed • Data and its processing is shared on large cluster multiple cheap commodity servers • loose coupling, isolation, location transparency, data locality & app-level composition • High Utilisation • The infrastructure give the best use of computing resources • Resilience (handling failure) • The infrastructure stays responsive in the face of failure and can “heal” • Scalability • grows to meet demand (Elastic), responsive under varying workload (Load balanced) • Operationally efficient • Needs to be highly automated, be very easy to maintain
  • 25. © Insight 2014. All Rights Reserved Distributed High Utilisation & Scalability The Rise of distributed Datacenter Schedulers
  • 26. © Insight 2014. All Rights Reserved Why Datacenter schedulers? Schedulers run your Distributed Apps • are an operating system kernel for the cloud • Schedulers coordinate execution of work on cluster • help you to get as many compute resources as you want whenever you want it • Abstract some scalability and load balancing issues
  • 27. © Insight 2014. All Rights Reserved Benefits of using a Scheduler • Efficiency - best use of computing resources • Agility - change your application mix with no turnaround • Scalability - grow to the current demand of your app • Modularity - 2 level schedulers have plugin frameworks that allow quick repurposing of core and no reliance on one vendor (more later)
  • 28. © Insight 2014. All Rights Reserved Datacenter schedulers Schedulers help you focus on your own work and not the infrastructure. “its great to be able to focus on what it is you want to be doing rather than worrying about how do you get what it is you need in order to be able to get stuff done” - John Wilkes (Google)
  • 29. © Insight 2014. All Rights Reserved Quick history of distributed schedulers 2004 mapreduce paper 2004 Google Borg 2011 Hadoop1.0 2003 Google filesystem 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2008 Hadoop released 2013 Yarn 2010 Spark Paper 2010 Nexus (Mesos) 2005 Hadoop started 2013 Mesos Released 2011 Mesos Paper 2014 Kubernetes 2014 Google Omega paper History of Datacenter Schedulers 2003 Slurm
  • 30. © Insight 2014. All Rights Reserved Hadoop Monolithic scheduler: Original opensource datacenter scheduler • jobs are batched and executed • Designed only to run Mapreduce jobs • No concurrency between apps • Evolving into yarn Hadoop Linux Server Linux Server hadoop- resource management mesos slavemesos slave Linked datat m/r job Linked datat m/r job Linked datat app
  • 31. © Insight 2014. All Rights Reserved Mesos 2 level scheduler : More flexible • Can Schedule many kinds of applications • Frameworks (such as spark) are delegated the per application scheduling • Mesos responsible for resource distribution between applications and enforcing overall fairness • Very modular, due to 2 level scheduling. frameworks manage apps as they like Mesos Linux Server Linux Server Mesos - resource management Mesos - scheduler jobs framework chronos mesos slave framework spark framework marathon mesos slave Hadoop M/R job Linked data job Linked datat app
  • 32. © Insight 2014. All Rights Reserved Schedulers Recap Scheduler allow you to use your cluster as one machine • ease operations • provide elasticity and load balancing • Can run both batch and longer running jobs • Are Efficient, Agile, scalable and modular
  • 33. © Insight 2014. All Rights Reserved Managing Failure in Infrastructure “Everything fails all the time” Werner Vogels (CTO Amazon)
  • 34. © Insight 2014. All Rights Reserved Handling Failure The harsh reality: all distributed infrastructures must deal with failure for the designers of applications running on distributed infrastructure (even Mesos), a great number of design mistakes need to be avoided...
  • 35. © Insight 2014. All Rights Reserved Fallacies of distributed computing • The most common misconceptions that lead to failure of Network infrastructure: • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesnt change. • There is one administrator. • Transport cost is zero. • The network is homogeneous All prove to be false in the long run and all cause big trouble and painful learning experiences.
  • 36. © Insight 2014. All Rights Reserved How successful infrastructure avoids failure No Single point of failure multiple masters, multiple copies of data, and redundancy on all services. Use elections between masters, use distributed locks and thresholds. Design Applications to Expect Failure Applications should continue to function even if the underlying physical hardware fails. Evolving infrastructure eases this by the use schedulers and container managers. Special Channel for failures Provides the means to delegate errors as messages on their own channel or service. Techniques include log aggregation and shared monitoring services.
  • 37. © Insight 2014. All Rights Reserved Operationally efficient Using Containers for better Quality of Service
  • 38. © Insight 2014. All Rights Reserved Containers Containers run your application in isolation in a portable and repeatable fashion “Because of the way that ... containers separate the application constraints from infrastructure concerns, we help solve that dependency hell.” - Docker
  • 39. © Insight 2014. All Rights Reserved Why Containers Containers Are: • Small (footprint) • Portable • Fast Containers Allow: • Resiliency • can be redeployed in seconds • Operationally efficiency • The infrastructure can “heal” • Scalability • on a single server allows resources (such as CPU) to be dialed up or down
  • 40. © Insight 2014. All Rights Reserved Containers vs. VMs Virtual Machines emulate a virtual hardware, require considerable overhead in CPU, Disk and Memory Containers use shared operating systems much more efficient than hypervisors in system resource terms
  • 41. © Insight 2014. All Rights Reserved Containers vs. VMs
  • 42. © Insight 2014. All Rights Reserved Containers But which container to use. Many choices exist… LXC, Docker, Rocket
  • 43. © Insight 2014. All Rights Reserved Docker - our container of choice Why Docker? • Best of breed at the moment • integrates natively with Mesos and Kubernetes • Has great infrastructure including • Docker registry for looking up containers • Docker compose for combining containers Alternatives • Rockit - More secure as uses init.d but Linux only
  • 44. © Insight 2014. All Rights Reserved Container Standardisation But choosing a container is not a commitment... Looks like standardisation around the corner via open container: https://www.opencontainers.org/ Initiative Sponsors: Apcera, AT&T, AWS, Cisco, ClusterHQ, CoreOS, Datera, Docker, EMC, Fujitsu, Google, Goldman Sachs, HP, Huawei, IBM, Intel, Joyent, Kismatic, Kyup, the Linux Foundation, Mesosphere, Microsoft, Midokura, Nutanix, Oracle, Pivotal, Polyverse, Rancher, Red Hat, Resin.io, Suse, Sysdig, Twitter, Verizon, VMWare
  • 45. © Insight 2014. All Rights Reserved Recap: Containers Containers Are: • Small (footprint), Portable, Fast • Allow you to repeatedly deploy applications • Work well with schedulers such as Mesos • Help with Resiliency and scalability
  • 46. © Insight 2014. All Rights Reserved Putting it all together Insights Linked Data infrastructure
  • 47. © Insight 2014. All Rights Reserved Linked Data Infrastructure Mesos Mesos - scheduler short jobs Mesos - scheduler long run jobs Spark Fwk Chronos Fwk Marathon Framework OS Monitor Mesos Monitor Linux Server Linux Server Linux Server Mesos - resource management mesos client Docker mesos client Docker mesos client Docker Resources cpu mem disk Managed by Mesos Applications work with frameworks to get resources they need Frameworks Negotiate with mesos to run their jobs Datastores HDT, Neo4JgraphX Granatum RevealedGraph Jobs Docker manages isolation on Linux servers
  • 48. © Insight 2014. All Rights Reserved Linked Data Infrastructure Mesos Linux Server Linux Server Linux Server Mesos - resource management Mesos - scheduler short jobs Mesos - scheduler long run jobs Spark Fwk Chronos Fwk Datastores HDT, Neo4J Marathon graphX mesos client Docker OS Monitor Mesos Monitor mesos client Docker mesos client Docker We use graph X for large graph batch jobs We use both HDT(RDF Store) Neo4J (Graph) Granatum Revealed We deploy specialised linked data applications to cluster
  • 49. © Insight 2014. All Rights Reserved Recap To provide linked data at scale and with the right service mix, infrastructures need to consider: • Services to help application to be Distributed Scalable • High Utilisation of computing resources • Know how you will handle failure • Operationally efficient We suggest, using schedulers such as mesos with containers (Docker), use suitable frameworks (GraphX/Spark) and datastores (Neo4J)
  • 50. © Insight 2014. All Rights Reserved Data Curation Ali Hasnain, Narumol Prangnawarat, Naoise Dunne
  • 51. © Insight 2014. All Rights Reserved Intro We will discuss: •Serialisation formats for RDF •Converting between these… •Mapping to conventional data • D2RQ • TARQL
  • 52. © Insight 2014. All Rights Reserved Serialisation Formats
  • 53. © Insight 2014. All Rights Reserved RDF Serialization formats - W3C Standards • Turtle • a compact, human-friendly format. • N-Triples, N-Quads • a simple, easy-to-parse, line-based format that is not as compact as Turtle. Nquads: superset of N-Triples, for multiple RDF graphs • JSON-LD, • a JSON-based serialization • RDF/XML, • first standard format for serializing RDF.
  • 54. © Insight 2014. All Rights Reserved N triples unreadable <http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/creator> "Dave Beckett" . <http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/creator> "Art Barstow" . <http://www.w3.org/2001/sw/RDFCore/ntriples/> <http://purl.org/dc/elements/1.1/publisher> <http://www.w3.org/> .
  • 55. © Insight 2014. All Rights Reserved RDF Serialization formats - Non Standard Non standard but popular formats • N3 or Notation3, • a non-standard serialization that is very similar to Turtle, but has some additional features, such as the ability to define inference rules • HDT • Compressed Binary RDF, HDT compresses big RDF datasets while maintaining search and browse operations • Microformats • Similar to RDF, and can be converted RDF. Uses html pages as both a human readable document and machine readable data, very big on web
  • 56. © Insight 2014. All Rights Reserved Converting between standards Any23 • Created at Deri (Insight) for converting popular serialisation formats. • Now First class apache project • Online converter http://any23-vm.apache.org/
  • 57. © Insight 2014. All Rights Reserved Mapping traditional data to Linked Data Two very popular tools created at Insight (Deri) D2RQ - http://d2rq.org/ Maps relational databases to RDF TARQL Maps tables (csv) to RDF
  • 58. © Insight 2014. All Rights Reserved Why Tarql Very simple mapping syntax. • Most of the world’s structured data is stored as tables • Most RDBMS database tables can be denormalized to a single table • Data cleansing can be an earlier step and make use of best in case for tabular data • Compared to tools such as D2RQ very easy to learn and use
  • 59. © Insight 2014. All Rights Reserved TARQL General structure of TARQL mapping: • Normal SPARQL select but has From file parameter • Can work on selects and constructs. SELECT DISTINCT ?id ?name FROM <file:filename.csv> WHERE {} LIMIT 100
  • 60. © Insight 2014. All Rights Reserved TARQL - Select Looks very like normal SPARQL, in where clause • Special Bind statements that bind column name with graph construct SELECT ... WHERE { BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri) BIND (STRLANG(?a, 'en') AS ?with_language_tag) }
  • 61. © Insight 2014. All Rights Reserved TARQL -Construct Looks very like normal SPARQL, in where clause • Special Bind statements that bind column name with graph construct CONSTRUCT { ?URI a ex:Organization; ex:name ?NameWithLang; ex:CIK ?CIK; ex:LEI ?LEI; ex:ticker ?Stock_ticker; } FROM <file:companies.csv> WHERE { BIND (URI(CONCAT('companies/', ?Stock_ticker)) AS ?URI) BIND (STRLANG(?Name, "en") AS ?NameWithLang) }
  • 62. © Insight 2014. All Rights Reserved Tarql Any23 • Created at Insight (DERI) for converting popular serialisation formats. • Online converter http://any23-vm.apache.org/
  • 63. © Insight 2014. All Rights Reserved Accessing Data- Query Federation Ali Hasnain, Narumol Prangnawarat, Naoise Dunne
  • 64. © Insight 2014. All Rights Reserved SPARQL Query Federation Approaches ••SPARQL Endpoint Federation (SEF) ••Linked Data Federation (LDF) ••Distributed Hash Tables (DHTs) ••Hybrid of SEF+LDF Curtsey Muhammad Saleem (AKSW)
  • 65. © Insight 2014. All Rights Reserved SPARQL Query Federation Approaches Curtsey Muhammad Saleem (AKSW)
  • 66. © Insight 2014. All Rights Reserved SPARQL Endpoint Federation Approaches • Most commonly used approaches • Make use of SPARQL endpoints URLs • Fast query execution • RDF data needs to be exposed via SPARQL endpoints • E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD etc. Curtsey Muhammad Saleem (AKSW)
  • 67. © Insight 2014. All Rights Reserved Linked Data Federation Approaches Data needs not be exposed via SPARQL endpoints Uses URI lookups at runtime Data should follow Linked Data principles Slower as compared to previous approaches E.g., LDQPS, SIHJoin, WoDQA etc. Curtsey Muhammad Saleem (AKSW)
  • 68. © Insight 2014. All Rights Reserved Query federation on top of Distributed Hash Tables •Uses DHT indexing to federate SPARQL queries •Space efficient •Cannot deal with whole LOD •E.g., ATLAS Curtsey Muhammad Saleem (AKSW)
  • 69. © Insight 2014. All Rights Reserved Hybrid of SEF+LDF Federation over SPARQL endpoints and Linked Data Can potentially deal with whole LOD E.g., ADERIS-Hybrid Curtsey Muhammad Saleem (AKSW)
  • 70. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved SPARQL Endpoint Federation S1 S2 S3 S4 RDF RDF RDF RDF Parsing/Rewriting Source Selection Federator Optimzer Integrator Rewrite query and get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub-query Exe. Plan Integrate sub- queries results Execute sub- queries Curtsey Muhammad Saleem (AKSW)
  • 71. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 Source Selection Curtsey Muhammad Saleem (AKSW)
  • 72. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 Source Selection Curtsey Muhammad Saleem (AKSW)
  • 73. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 TP4 = S4 Source Selection Curtsey Muhammad Saleem (AKSW)
  • 74. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved FedBench (LD3): Return for all US presidents their party membership and news pages about them. SELECT ?president ?party ?page WHERE { ?president rdf:type dbpedia:President . ?president dbpedia:nationality dbpedia:United_States . ?president dbpedia:party ?party . ?x nyt:topicPage ?page . ?x owl:sameAs ?president . } dbpedia RDF Source Selection Algorithm Triple pattern-wise source selection S1TP1 = KEGG RDF ChEBI RDF NYT RDF SWDF RDF LMDB RDF Jamendo RDF Geo Names RDF DrugBank RDF S1 S2 S3 S4 S5 S6 S7 S8 S9 //TP1 //TP3 //TP4 //TP5 //TP2 TP2 = S1 TP3 = S1 TP4 = S4 TP5 = S1 S2 S4-S9 Source Selection Total triple pattern-wise sources selected = 1+1+1+1+8 => 12 Curtsey Muhammad Saleem (AKSW)
  • 75. © Insight 2014. All Rights Reserved SPARQL Query Federation Engine •FedX •SPLENDID •HiBISCuS+FedX •HiBISCuS+SPLENDID •ANAPSID •BioFed •LHD •DARQ Curtsey Muhammad Saleem (AKSW)
  • 76. © Insight 2014. All Rights Reserved Overview of Implementation details of Federated Sparql Query Engines
  • 77. © Insight 2014. All Rights Reserved System Features of Federated Sparql Query Engines
  • 78. © Insight 2014. All Rights Reserved System’s Support for SPARQL Query Construct
  • 79. © Insight 2014. All Rights Reserved Analyze of Linked Data at scale Narumol Prangnawarat, Ali Hasnain, Naoise Dunne
  • 80. © Insight 2014. All Rights Reserved Linked Data A method of publishing structured data so that it can be interlinked. • Normally represented as RDF • Normally queried using SPARQL 4 principles of linked data 1. Use URIs to name (identify) things. 2. Use HTTP URIs so can be looked up 3. Provide metadata about what thing is. - use open standards RDF, SPARQL, etc. 4. Link to other things using their HTTP URI- based names
  • 81. © Insight 2014. All Rights Reserved How do I query Linked data
  • 82. © Insight 2014. All Rights Reserved Linked data as a graph Many approaches to query and reason over linked data exist the most popular query language in the RDF community is sparql, but alternatives exist… … if we think about linked data as a graph
  • 83. © Insight 2014. All Rights Reserved Linked data as a graph As Linked data graphs share graph structure and can be connected we can reason over them using graph algorithms. Popular approaches to querying graphs are: • Declarative pattern matching - SPARQL • graph traversal languages - Cypher, Gremlin • distributed graph data structures - GraphX
  • 84. © Insight 2014. All Rights Reserved Comparing Graph Query Approaches
  • 85. © Insight 2014. All Rights Reserved Comparing Graph Query Approaches Cypher - graph traversal Somewhere between graph traversal and pattern matching runs on Neo4J Characteristics • Property graph • index-free adjacency • uses adjacency tables Pros great at localized searches (shortest path for instance) is a database Is Expressive language Cons Poor at aggregation Graph X - message passing DSL rather than language. For “programmers” Scala, Java and Python APIs. Characteristics • Resilient Distributed Property Graph • index-free adjacency • Vertex/Edge table Pros Best (in list) distributed execution Powerful mix of table and traversal has a set of optimised operators Cons Not a database, data needs to be loaded and stored separately SPARQL -declarative Declarative language, allowing better expression and abstracting the writer from optimisation problems Characteristics Stores Triples (tuples) Vendor normally optimise popular queries. Often built on RDBMS Pros Is Expressive language Easy to express search patterns Cons Difficult to scale Poor at traversal
  • 86. © Insight 2014. All Rights Reserved Comparing Graph Query Approaches Cypher - graph traversal Somewhere between graph traversal and pattern matching runs on Neo4J Graph X - message passing DSL rather than language. Scala, Java and Python APIs. SPARQL -declarative is a declarative language, allowing better expression and abstracting the writer from optimisation problems Abstract Concrete Considerable work (for vendors) to scale Little work to scale Optimised for aggregation Optimised for connections Global Local
  • 87. © Insight 2014. All Rights Reserved Other Graph Query Languages Cypher - graph traversal Somewhere between graph traversal and pattern matching runs on Neo4J Alternatives Gremlin A “pure” graph traversal language based on xpath Network X A python DSL for graphs Graph X - Message passing DSL rather than language. For “programmers” Scala, Java and Python APIs. Alternatives GraphLab very powerful commercial product Giraffe A hadoop API for graphs SPARQL - Declarative Declarative language, allowing better expression and abstracting the writer from optimisation problems Alternatives Cypher(!) Has pattern matching constructs, can do much the same as sparql on Neo4J database
  • 88. © Insight 2014. All Rights Reserved Sparql at big data scale What about using SPARQL in distributed big data infrastructure?
  • 89. © Insight 2014. All Rights Reserved Sparql at big data scale At big data scales and in distributed infrastructure SPARQL quickly becomes an impediment Why? • It is difficult to optimise SPARQL at scale with fast data • As optimisations are embedded in query, optimised SPARQL queries become less natural and hard to write. • SPARQL abstractions “leak”, leading to “hacks” of big data RDF infrastructure
  • 90. © Insight 2014. All Rights Reserved Why use Graph algorithms Why not Sparql • For distributed computing, declarative languages such as sparql are (for now) problematic • you cannot know if query is NP or EXP time. • Difficult to create query plans especially with fast or changing graphs • Have to rely on federation which is still being researched Why Graph Traversal • Proven to scale • Graphs traversal lower level but easier to tune so that it is in P time. • Most popular algorithm are local (as in) only query neighbouring nodes at any time -thus easier to break up across compute nodes
  • 91. © Insight 2014. All Rights Reserved Scaling Sparql So you still want to use sparql at scale How do we query RDF data with SPARQL at Big data scale? Clustering • Some vendors/platforms provide clustered triple-stores Federation Became available in Sparql 1.1 with SERVICE keyword • Federation emerged with great promises. • For distributed computing, sparql federation has limitations (for now). • Query planning for SPARQL is NP Complete
  • 92. © Insight 2014. All Rights Reserved Alternatives to Scaling Sparql Fortunately, both Graph traversal and GraphX approaches to linked data that work at scale today. We will look at these approaches now.
  • 93. © Insight 2014. All Rights Reserved Analysing Linked Data as a Graph Querying linked data at scale with complementary technologies to sparql
  • 94. © Insight 2014. All Rights Reserved Linked data as a graph Linked data can be considered a Heterogeneous Graph What do we mean by Heterogeneous Graph? • Linked data graphs share graph structure but have mixed characteristics • Nodes (Vertices) contain different data • Links (Edges) Mix of directed and undirected • Linked Data Graph can be weighted or not • Can have a mix of classes and types from differing Ontologies What do these graphs look like?
  • 95. © Insight 2014. All Rights Reserved graphs within a Heterogeneous Graph
  • 96. © Insight 2014. All Rights Reserved Linked data as a graph As Linked data graphs share graph structure and can be connected we can reason over them using graph algorithms using Popular approaches are: • graph traversal languages - Cypher, Gremlin • distributed graph data structures - GraphX Declarative pattern matching - SPARQL, but can be hard to scale
  • 97. © Insight 2014. All Rights Reserved Cypher graph language Some basic cypher... Find the actor named "Tom Hanks"... MATCH (tom {name: "Tom Hanks"}) RETURN tom Who directed "Cloud Atlas"? MATCH (cloudAtlas {title: "Cloud Atlas"})<-[:DIRECTED]-(directors) RETURN directors.name
  • 98. © Insight 2014. All Rights Reserved Shortest Path Shortest path is the problem of finding a path between two vertices (or nodes) in a graph with the lowest weight (this could be cost or distance).
  • 99. © Insight 2014. All Rights Reserved What is Dijkstra An approach for shortest path, using traversal, that changes complexity of problem from NP to P time naive shortest path between 2 points has complexity of: O(|V³|) Dijkstra approach we get a worst case complexity of: O(|E| + |V| log |V|)
  • 100. © Insight 2014. All Rights Reserved Dijkstra using Cypher MATCH p= shortestPath( (bacon:Person {name:"Kevin Bacon"})-[*]- (meg:Person {name:"Meg Ryan"})) RETURN p Find the shortest path between a Person with name Kevin Bacon and Meg Ryan DEMO
  • 101. © Insight 2014. All Rights Reserved Cypher Shortest path
  • 102. © Insight 2014. All Rights Reserved Another useful algo. Community detection What is community Detection a Graph is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally.
  • 103. © Insight 2014. All Rights Reserved Visualisation tools for large graphs On larger datasets Neo4js UI is too unresponsive for large result sets from queries such as community detection. • instead you may want to use a visualisation tool such as Gephi. • We will now demonstrate Louvain Community Detection using the graph visualisation tool Gephi.
  • 104. © Insight 2014. All Rights Reserved Using Gephi Here is a graph that we calculated using Luvian earlier... This example shows clustering around Topics discussed on twitter: • graph cluster tweets around the topic under discussion • Each retweet or reply creates a link in the graph
  • 105. © Insight 2014. All Rights Reserved Gephi Screenshot
  • 106. © Insight 2014. All Rights Reserved Using Gephi Here is a graph that we calculated using Luvian earlier... This example shows clustering around Topics discussed on twitter: • we cluster tweets around the topic discussion • Each retweet or reply creates a link in the graph Once the Luvian analysis is complete, we load the results to gephi and we gain new insights on visualising the clusters around these topics: • Compared to text analysis we are able to detect deeper community structures from the retweets and replies to tweets • This gives us a deeper understanding of the individuals and communities
  • 107. © Insight 2014. All Rights Reserved Sparql Federation example Q: How are the protein targets of the gleevec drug differentially expressed, which pathways are they involved in? PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#> PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX sio: <http://semanticscience.org/resource/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT distinct ?dbXref (str(?pathwayname) as ?pathname) ?factorLabel WHERE { # query chembl for gleevec (CHEMBL941) protein targets ?act a cco:Activity; cco:hasMolecule chembl_molecule:CHEMBL941 ; cco:hasAssay ?assay . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?dbXref . ?targetcmpt cco:taxonomy . ?dbXref a cco:UniprotRef # query for pathways by those protein targets SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> { ?protein rdf:type biopax3:Protein . ?protein biopax3:memberPhysicalEntity [biopax3:entityReference ?dbXref] . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction ?rel ?protein . } # get Atlas experiment plus experimental factor where protein is expressed SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?probe atlasterms:dbXref ?dbXref . ?value atlasterms:isMeasurementOf ?probe . ?value atlasterms:hasFactorValue ?factor . ?value rdfs:label ?factorLabel . } }
  • 108. © Insight 2014. All Rights Reserved Why use Graph algorithms Polynomial time A lot of queries on linked data can be expressed as well known graph traversal algo. that work in P time such as Dijkstra for positive weighted directed graph. Because these alog are localised suit distributed computed.
  • 109. © Insight 2014. All Rights Reserved More pure Dijkstra using Cypher MATCH (from: Location {LocationName:"x"}), (to: Location {LocationName:"y"}) , paths = allShortestPaths((from)-[:CONNECTED_TO*]->(to)) WITH REDUCE(dist = 0, rel in rels(paths) | dist + rel.distance) AS distance, paths RETURN paths, distance ORDER BY distance LIMIT 1 The other approach used an inbuilt function - this shows a closer approximation of the actual algorithm
  • 110. © Insight 2014. All Rights Reserved Dijkstra Psudocode dist[s]←0 (distance to source vertex is zero) forall v ∈ V–{s} do dist[v]←∞ (set all other distances to infinity) S←∅ (S,the set of visited vert is initially empty) Q←V (Q,the queue initially contains all vertices) whileQ≠∅ (while the queue is not empty) dou← mindistance(Q,dist) (select the element of Q with the min.distance) S←S∪{u} (add u to list of visited vertices) forall v ∈ neighbors[u] do if dist[v]>dist[u]+w(u,v) (if new shortest path found) then d[v]←d[u]+w(u,v) (set new value of shortest path) (if desired,add trace back code) returndist
  • 111. © Insight 2014. All Rights Reserved Visualization Ali Hasnain, Stefan Decker, Naoise Dunne
  • 112. © Insight 2014. All Rights Reserved Visualization • Visualize your Data! • Available Tools • ReVeaLD • FedViz • Genome Wheel
  • 113. © Insight 2014. All Rights Reserved ReVeaLD Search Platform ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user- driven domain-specific search platform. Intuitively formulate advanced search queries using a click-input-select mechanism Visualize the results in a domain–suitable format. Assembly of the query is governed by a Domain Specific Language (DSL), which in this case is the Granatum Biomedical Semantic Model (CanCO)
  • 114. © Insight 2014. All Rights Reserved ReVeaLD Search Platform Availability: http://n10.soma.insight-centre.org:31005/explorer Demo: https://www.youtube.com/watch?v=6HHK4ASIkJM&hd=1 Curtsey Maulik Kamdar
  • 115. © Insight 2014. All Rights Reserved DSL Visual Representation Concept Map Visualization is used.
  • 116. © Insight 2014. All Rights Reserved Visual Query Builder
  • 117. © Insight 2014. All Rights Reserved Visual Query Model
  • 118. © Insight 2014. All Rights Reserved FedViz: A Visual Interface for SPARQL Queries Formulation and Execution FedViz is an online application that provides Biologist a flexible visual interface to formulate and execute both federated and non-federated SPARQL queries. It translates the visually assembled queries into SPARQL equivalent and execute using query engine (FedX). Availability: http://srvgal86.deri.ie/FedViz/index.html Curtsey Sana e Zainab
  • 119. © Insight 2014. All Rights Reserved FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
  • 120. © Insight 2014. All Rights Reserved Using FedViz: Step by Step
  • 121. © Insight 2014. All Rights Reserved GenomeSnip Platform A semantic, visual analytics prototype devised to expedite knowledge exploration and discovery in cancer research. Idea: ‘Snip’ the human genome informatively in fragments through interaction with an aggregative, circular visualization, the ‘Genomic Wheel’ (circular) and introspectively analyze the snipped fragments in a ‘Genomic Tracks’ (linear) display. Technologies: Web-based client application developed using native technologies like HTML5 Canvas, JavaScript and JSON. KineticJS library, an HTML5 Canvas JavaScript framework, is used for node nesting, layering, caching and event handling. Availability: http://srvgal78.deri.ie/genomeSnip/ Curtsey Maulik Kamdar
  • 122. © Insight 2014. All Rights Reserved Genome Browser
  • 123. © Insight 2014. All Rights Reserved Hands On Session Narumol Prangnawarat, Ali Hasnain, Naoise Dunne
  • 124. © Insight 2014. All Rights Reserved© Insight 2014. All Rights Reserved Convert CSV File to RDF using TARQL Instructions Manual https://goo.gl/xLpF8Y