Challenges and patterns for semantics at scale

C O M P U T E | S T O R E | A N A L Y Z E
Challenges and Patterns for
Semantics at Scale
Rob Vesse
rvesse@cray.com
@RobVesse

Overview
● Background
● Challenges & Patterns
● Obtaining Data
● Input Format
● Blank Nodes
● Graph Partitioning
● Benchmarking

Background
● PhD in Computer Science
● Open Source
● Apache Jena
● dotNetRDF
● Software Engineer at Cray Inc
● In Analytics R&D
● Last 5 years
● Cray sells a range of analytics products
● Cray Graph Engine
● Massively scalable parallel RDF database and SPARQL engine
● Runs on GX and XC hardware platforms
● GX nodes are roughly equivalent to r3.8xlarge EC2 instance

Background - Terminology
● What do we mean by at scale?
● Typical customers have 10s of billions of triples
● Some are around the 100 billion mark
● What do we mean by parallelism?
● On node i.e. multiple threads/processes
● Across nodes i.e. multiple machines

Challenge #1 - Obtaining Data
● Most Data does not start out as RDF
● Relational databases, spreadsheets, structured/semi-structured
data, flat files etc.
● It varies depending on customer domain
● Therefore the first challenge is to get the data into RDF
● Problems
● Many ETL tools don't support it as an output format
● Even if tools do support it they are not scalable
● E.g D2RQ (http://d2rq.org)

Pattern #1 - Leverage Big Data
● Lots of big data projects can be used to implement ETL
pipelines
● E.g. Map Reduce, Spark, Flume, Sqoop
● There are some libraries available that provide basic
plumbing for this e.g.
● Apache Jena Elephas
● http://jena.apache.org/documentation/hadoop/index.html
● Unfortunately ETL tends to be very customer and data
specific

Challenge #2 - Input Format
● What data format should we be using?
● There are at least four widely used standard
serialisations:
● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD
● Plus the variety of lesser used formats e.g. TriX, RDF/JSON,
HDT, RDF/Thrift, Sesame Binary RDF etc
● Choice of format affects how you process it
● Parallel processing
● Error Tolerance
● State Tracking

Pattern #2 - Use NTriples/NQuads
● Simple but effective
● Can be arbitrarily split into chunks
● E.g. Pick some number of bytes, split into chunks, seek from
chunk boundaries to find actual line boundaries, process line by
line
● Extremely error tolerant
● Every line can be processed independently without needing
any shared state
● Even this has challenges:
● Verbose format so large datasets require extremely large files
● Blank nodes can still be problematic

Challenge #3 - Blank Node Identifiers
● Specifications say that a blank node
identifier is file scoped
● I.e. _:foo in a.nt is a different node from
_:foo in b.nt
● And _:foo is the same node throughout
a.nt
● Need to consistently assign identifiers
despite processing the data in chunks on
different physical nodes
● Preferably without resorting to global
state/synchronisation
<urn:a> <urn:link> _:foo .
_:foo <urn:link> <urn:b> .
# Many 100,000s of lines later
<urn:z> <urn:link> _:foo .
_:foo <urn:value> “example” .
_:bar <urn:value> “other” .
a.nt
b.nt

Pattern #3 - Derived Blank Node Identifiers
● Derive identifiers from a combination of their local
identifier and a scope identifier
● E.g. _:foo and a.nt
● Derivation method doesn't matter provided it is:
● Scope aware
● Deterministic
● Some possibilities:
● One-way hash e.g. MD5
● Mathematical transform
● Seeded random number generator (RNG)
● Apache Jena uses seeded RNG
● Scope awareness achieved by seeding the RNG based upon
the filename

Challenge #4 - Graph Partitioning
● Open Problem
● NP Hard
● Large graphs are never going to be processable on a
single node
● Need to partition across multiple nodes
● Partitioning affects both storage and processing of a
graph
● May need different schemes depending on desired processing

Pattern #4 - Domain Specific/Avoid It!
● For specific workloads a domain specific partitioning will
be best
● Needs knowledge of data and workload
● E.g. Educating the Planet with Pearson
● If you can then avoid it!
● Take advantage of increasingly capable hardware
● Large memory sizes, non-volatile memory, RDMA, high speed
interconnects, SSDs

Challenge #5 - Benchmarking
● Many of the classic benchmarks were developed by
academics
● E.g. LUBM, SP2B
● Often aren’t representative of actual customer problems
● Many data generators are single threaded
● Difficult to generate large-scale datasets

Pattern #5 - Change Benchmarks
● Linked Data Benchmark Council (LDBC)
● Industry working group that develops standardised benchmarks
● Equivalent to Transaction Processing Council (TPC) in
relational database industry
● http://ldbcouncil.org
● Design your own
● https://github.com/rvesse/sparql-query-bm
● Improve an existing one
● https://github.com/rvesse/lubm-uba
● LUBM 8k (~ 1 Billion Triples) can be generated in under 7
minutes which is a 10x speed up

Questions?
rvesse@cray.com
@RobVesse

Challenges and patterns for semantics at scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Challenges and patterns for semantics at scale

Similaire à Challenges and patterns for semantics at scale (20)

Dernier

Dernier (20)

Challenges and patterns for semantics at scale