Discusses some of the challenges around applying semantics at scale (tens of billions of triples and larger). Describes some of the patterns that can be used to meet those challenges.
SQL Database Design For Developers at php[tek] 2024
Challenges and patterns for semantics at scale
1. C O M P U T E | S T O R E | A N A L Y Z E
Challenges and Patterns for
Semantics at Scale
Rob Vesse
rvesse@cray.com
@RobVesse
2. C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges & Patterns
● Obtaining Data
● Input Format
● Blank Nodes
● Graph Partitioning
● Benchmarking
3. C O M P U T E | S T O R E | A N A L Y Z E
Background
● PhD in Computer Science
● Open Source
● Apache Jena
● dotNetRDF
● Software Engineer at Cray Inc
● In Analytics R&D
● Last 5 years
● Cray sells a range of analytics products
● Cray Graph Engine
● Massively scalable parallel RDF database and SPARQL engine
● Runs on GX and XC hardware platforms
● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
4. C O M P U T E | S T O R E | A N A L Y Z E
Background - Terminology
● What do we mean by at scale?
● Typical customers have 10s of billions of triples
● Some are around the 100 billion mark
● What do we mean by parallelism?
● On node i.e. multiple threads/processes
● Across nodes i.e. multiple machines
5. C O M P U T E | S T O R E | A N A L Y Z E
Challenge #1 - Obtaining Data
● Most Data does not start out as RDF
● Relational databases, spreadsheets, structured/semi-structured
data, flat files etc.
● It varies depending on customer domain
● Therefore the first challenge is to get the data into RDF
● Problems
● Many ETL tools don't support it as an output format
● Even if tools do support it they are not scalable
● E.g D2RQ (http://d2rq.org)
6. C O M P U T E | S T O R E | A N A L Y Z E
Pattern #1 - Leverage Big Data
● Lots of big data projects can be used to implement ETL
pipelines
● E.g. Map Reduce, Spark, Flume, Sqoop
● There are some libraries available that provide basic
plumbing for this e.g.
● Apache Jena Elephas
● http://jena.apache.org/documentation/hadoop/index.html
● Unfortunately ETL tends to be very customer and data
specific
7. C O M P U T E | S T O R E | A N A L Y Z E
Challenge #2 - Input Format
● What data format should we be using?
● There are at least four widely used standard
serialisations:
● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD
● Plus the variety of lesser used formats e.g. TriX, RDF/JSON,
HDT, RDF/Thrift, Sesame Binary RDF etc
● Choice of format affects how you process it
● Parallel processing
● Error Tolerance
● State Tracking
8. C O M P U T E | S T O R E | A N A L Y Z E
Pattern #2 - Use NTriples/NQuads
● Simple but effective
● Can be arbitrarily split into chunks
● E.g. Pick some number of bytes, split into chunks, seek from
chunk boundaries to find actual line boundaries, process line by
line
● Extremely error tolerant
● Every line can be processed independently without needing
any shared state
● Even this has challenges:
● Verbose format so large datasets require extremely large files
● Blank nodes can still be problematic
9. C O M P U T E | S T O R E | A N A L Y Z E
Challenge #3 - Blank Node Identifiers
● Specifications say that a blank node
identifier is file scoped
● I.e. _:foo in a.nt is a different node from
_:foo in b.nt
● And _:foo is the same node throughout
a.nt
● Need to consistently assign identifiers
despite processing the data in chunks on
different physical nodes
● Preferably without resorting to global
state/synchronisation
<urn:a> <urn:link> _:foo .
_:foo <urn:link> <urn:b> .
# Many 100,000s of lines later
<urn:z> <urn:link> _:foo .
_:foo <urn:value> “example” .
_:bar <urn:value> “other” .
a.nt
b.nt
10. C O M P U T E | S T O R E | A N A L Y Z E
Pattern #3 - Derived Blank Node Identifiers
● Derive identifiers from a combination of their local
identifier and a scope identifier
● E.g. _:foo and a.nt
● Derivation method doesn't matter provided it is:
● Scope aware
● Deterministic
● Some possibilities:
● One-way hash e.g. MD5
● Mathematical transform
● Seeded random number generator (RNG)
● Apache Jena uses seeded RNG
● Scope awareness achieved by seeding the RNG based upon
the filename
11. C O M P U T E | S T O R E | A N A L Y Z E
Challenge #4 - Graph Partitioning
● Open Problem
● NP Hard
● Large graphs are never going to be processable on a
single node
● Need to partition across multiple nodes
● Partitioning affects both storage and processing of a
graph
● May need different schemes depending on desired processing
12. C O M P U T E | S T O R E | A N A L Y Z E
Pattern #4 - Domain Specific/Avoid It!
● For specific workloads a domain specific partitioning will
be best
● Needs knowledge of data and workload
● E.g. Educating the Planet with Pearson
● If you can then avoid it!
● Take advantage of increasingly capable hardware
● Large memory sizes, non-volatile memory, RDMA, high speed
interconnects, SSDs
13. C O M P U T E | S T O R E | A N A L Y Z E
Challenge #5 - Benchmarking
● Many of the classic benchmarks were developed by
academics
● E.g. LUBM, SP2B
● Often aren’t representative of actual customer problems
● Many data generators are single threaded
● Difficult to generate large-scale datasets
14. C O M P U T E | S T O R E | A N A L Y Z E
Pattern #5 - Change Benchmarks
● Linked Data Benchmark Council (LDBC)
● Industry working group that develops standardised benchmarks
● Equivalent to Transaction Processing Council (TPC) in
relational database industry
● http://ldbcouncil.org
● Design your own
● https://github.com/rvesse/sparql-query-bm
● Improve an existing one
● https://github.com/rvesse/lubm-uba
● LUBM 8k (~ 1 Billion Triples) can be generated in under 7
minutes which is a 10x speed up
15. C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
@RobVesse