Huge RDF datasets are currently exchanged on textual RDF formats, hence consumers need to post-process them using RDF stores for local consumption, such as indexing and SPARQL query. This results in a painful task requiring a great effort in terms of time and compu- tational resources. A first approach to lightweight data exchange is a compact (binary) RDF serialization format called HDT. In this paper, we show how to enhance the exchanged HDT with additional structures to support some basic forms of SPARQL query resolution without the need of "unpacking" the data. Experiments show that i) with an exchanging ef- ficiency that outperforms universal compression, ii) post-processing now becomes a fast process which iii) provides competitive query performance at consumption.
1. Digital Enterprise Research Institute www.deri.ie
Exchange and Consumption of
Huge RDF Data
Miguel A. Martínez-Prieto1,2 <migumar2@infor.uva.es>
Mario Arias1,3 <mario.arias@deri.org>
Javier D. Fernández1,2 <jfergar@infor.uva.es>
1. Department of Computer Science, Universidad de Valladolid (Spain)
2. Department of Computer Science, Universidad de Chile (Chile)
3. Digital Enterprise Research Institute, National University of Ireland Galway
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
2. Sharing RDF in the Web of Data.
Digital Enterprise Research Institute www.deri.ie
Parsing / Indexing
Reasoning
R
• Dataset analysis. I
• Setup a SPARQL server. P
• Vocabulary interlinking / integration.
• Browsing and Visualization.
sensor • Exchange between servers
• Data-intensive tasks.
dereferenceable URIs
RDF dump
SPARQL Endpoints/
APIs
3. Dataset Exchange Workflow
Digital Enterprise Research Institute www.deri.ie
1º 2º 3º
Publication Exchange Consumption
Convert Transfer Decompress
If RDF is meant to be machine processable,
Serialize Parse
Why are we using plain text serialization formats??
Compress Index
4. HDT: RDF Binary Format
Digital Enterprise Research Institute www.deri.ie
Compact Data Structure for RDF.
W3C Submission. http://www.w3.org/Submission/2011/03/
Open Source C++/Java library.
5. HDT Focused on Querying
Digital Enterprise Research Institute www.deri.ie
FoQ
Contribution of this paper:
A complementary Index to make the HDT fully queryable.
Analysis on how HDT reduces exchange and indexing time.
Evaluate in-memory query performance.
6. Dictionary
Digital Enterprise Research Institute www.deri.ie
Mapping of strings to correlative IDs. {1..n}
Lexicographically sorted, no duplicates.
Section compression explained at [8]
7. Triples Model
Digital Enterprise Research Institute www.deri.ie
Triples
S 1 2 3
126
132
213 P[ 2 3] [ 1 2 ] [4 ] 3
224
225 O[ 6 ][ 2] [ ][
3 4 ] [5 ] [1 ] 2
241
332
8. Adjacency Lists
Digital Enterprise Research Institute www.deri.ie
1 2 3
[ 2 , 3] [ ,
1 ,2 ] [4 ] 3
1 2 3 4 5 6
Array 2 3 1 2 4 3
Bitmap 1 0 1 0 0 1
Operations:
– access(g) = Given a global position, get the value. O(1)
– findList(g) = Given a global position, get the list number. O(1)
O(log log n)
– first(l), last(l), = Given a list, find the first and last.
9. Triples Model and Coding
Digital Enterprise Research Institute www.deri.ie
Triples
S 1 2 3
126
132
213 P 2 3 1 2 4 3
224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1
Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1
10. Searching by Subject
Digital Enterprise Research Institute www.deri.ie
Triples
S 1 ( 2, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3
224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1
SPO, SP? Array Z 6 2 3 4 5 1 2
S??, S?O Bitmap Z 1 1 1 1 0 1 1
11. Searching by Predicate
Digital Enterprise Research Institute www.deri.ie
Triples
S 1 ( ?, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3
224
225 O 6 2 3 4 5 1 2
241 Array Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1
?P? Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1
12. Wavelet Tree
Digital Enterprise Research Institute www.deri.ie
Compact Sequence of Integers {0,σ}.
rank(3, 7) = 2
2 3 6 3 6 1 2
1 3 6 2 5 2 4 1 4 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
9 16
select(6, 3) = 9
access(position) = Value at position.
rank(entry, position) = Number of appearances of O(log σ)
O(log σ)
“entry” up to “position”. O(log σ)
select(entry, i) = Position where “entry” appears for the
i-th time.
13. Searching by Predicate w/ Wavelet
Digital Enterprise Research Institute www.deri.ie
Triples
S 1 ( ?, 2, ? ) 2 3
126
132
213 P 2 3 1 2 4 3
224
225 O 6 2 3 4 5 1 2
241
Wavelet Y 2 3 1 2 4 3
332 Bitmap Y 1 0 1 0 0 1
?P? Array Z 6 2 3 4 5 1 2
Bitmap Z 1 1 1 1 0 1 1
15. Data Structure Summary.
Digital Enterprise Research Institute www.deri.ie
From HDT to HDT-FoQ:
Convert Array Y to Wavelet.
Generate OP-Index.
Triple Patterns:
SPO, SP?, S??, S?O Original HDT
?P? Wavelet Tree
?PO, ??O OP-Index
17. Compression Ratio
Digital Enterprise Research Institute www.deri.ie
DBPedia
Geonames
hdt
gz
DBLP
lzma
hdt.gz
LinkedMDB
hdt.lzma
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Compression ratio (% against plain ntriples)
18. Publication Times
Digital Enterprise Research Institute www.deri.ie
NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min
DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min
Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min
DBPedia 15,9 min 466 min 56 min 60 min 121 min
dbpedia
geonames
dblp
linkedMDB
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Times slower than Ntriples+GZIP
gz lzma hdt hdt.gz hdt.lzma
19. Publication Times2
Digital Enterprise Research Institute www.deri.ie
NT+GZIP NT+LZMA HDT HDT+GZIP HDT+LZMA
linkedMDB 11,3 sec 14,7 min 1,05 min 1,09 min 1,52 min
DBLP 2,72 min 103 min 12 min 13,5 min 21,9 min
Geonames 3,28 min 244 min 25 min 26,4 min 38,9 min
DBPedia 15,9 min 466 min 56 min 60 min 121 min
dbpedia
geonames
dblp
linkedMDB
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Times slower than Ntriples + GZIP
gz hdt hdt.gz hdt.lzma
20. Exchange & Decompression Time
Digital Enterprise Research Institute www.deri.ie
GZIP
LZMA
HDT+GZIP
HDT+LZMA Exchange
Decompress
0 50 100 150 200 250 300
Seconds (Geometric Mean of all datasets)
*Assuming a Network Bandwidth of 2MByte/s
21. Overall Client Time
Digital Enterprise Research Institute www.deri.ie
LZMA+Virtuoso
GZ+Virtuoso
Exchange
LZMA+RDF3x
Decompress
Index
GZ+RDF3x
LZMA+RDF3x HDT+LZMA
linkedMDB 2,1 min 9,21 sec
HDT+LZMA+FOQ
dblp 27 min 2,02 min
geonames 49,2 min 3,04 min
HDT+GZIP+FOQ dbpedia 159 min 14,3 min
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600
Seconds (Geometric mean of all datasets)
22. In-memory Data Store.
Digital Enterprise Research Institute www.deri.ie
Triples Index Size (Mb)
Virtuoso Hexastore RDF3x HDT-FoQ
LinkedMDB 6,1M 518 6976 337 68
DBLP 46M 3982 - 3252 850
Geonames 112M 9216 - 6678 1435
DBPedia 258M - - 15802 5260
Less size = more data in memory = less I/O access!
25. Conclusions
Digital Enterprise Research Institute www.deri.ie
Data is ready to be consumed 10-15x faster.
Exchange time reduced.
Indexing burden on server = Lightweight client processing.
Competitive query performance.
Very fast on triple patterns.
Joins on the same scale of existing solutions.
This is useful to you:
If you need a fast, compact read-only in-memory RDF store.
If you want to share self-queryable RDF dumps.
If you need fast download & query.
Addresses the volume issue of Big Data.
26. Future work.
Digital Enterprise Research Institute www.deri.ie
Full SPARQL support.
UNION, Optional, Multiple Join.
Optimized query evaluation.
Integration:
Jena, Any23…
Diffussion.
Get more people to use it!
Additional services on top of HDT.
SPARQL Endpoint.
Distributed Stream Processing.
Mobile Applications.
Importance of exchange. The Web is for exchanging data. Data flows between nodes. We are in the “Big Data era” We need fast speed, from the network to the application layers.Role of providers / Consumers.Consumption =~ QueryingHow data is shared:Dereferenceable URIs.SPARQL Endpoints.Big datasets: RDF dump. ( Similar to XML, PDF ).Examples where RDF dumps are important: - Setup a mirror. - Overloaded SPARQL Server. - Data analysis. - Vocabulary integration. - Download instead of crawl. - Visualization.Opens new applications. - Processing intensive. - Cooperating applications.
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
CPUs are fast, memory/bandwidth are precious.Variable-length.Compression.Compact In-memory representations.
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -
Triples are sorted component by component.We represent them in a tree: - Each level represents S, P, O. - Each path / leave node represents one triple. How we encode the tree for 1 Space 2 Traverse. - Level by level. S implicit. P, O Array. - Relations with brackets / Bitmap. -