Presentation as held at the "Workshop on Knowledge Evolution and Ontology Dynamics" co-located with ISWC 2011. Related to the paper http://ceur-ws.org/Vol-784/evodyn1.pdf
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Statistical Analysis of Web of Data Usage
1. Statistical Analysis of Web of
Data Usage
Towards (Visual) Maintenance
Support for Dataset Publishers
Markus Luczak-Rösch, Markus Bischoff
Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de)
2. Who is addressed?
• rather small/simple ontologies
– min. effort for OE
– “under-engineered”
• unknown user requirements
3. We propose: A Usage-dependent Life
Cycle
Requests and
• RDB2RDF Queries • Re-engineering
• Crawling & • SELECT * WHERE ?t • Re-population
transformation a:madeOf a:Plastic •…
•… • SELECT * WHERE ?t
b:madeOf b:Wood
Negotiate
Initial Release
understanding
USAGE
4. (Very) Quick Example
• Out of which
instruments consists
The Beatles band?
• Are the Beatles a “Big Band”?
• What are “british” bands?
5.
6. • Is it what the user expected
to see?
• Did you know that
this happens and
do you know what
to do now?
7. Survey covering approx.
25% of all cloud datasets
• size
• complexity
• engineering methodology
• …
Publishers of most of the dataset do not
have any (structured) idea how to maintain
their data. Survey ran in October 2010, not yet
published officially
8. Role of the dataset publisher
(more general)
Effort Distribution between Publisher and Consumer
• use common
vocabularies
• provide RDF
Consumer generates/
links to other data mines links
resources Effort
• provide Distribution
schema Publisher provides Links as
links
mappings hints
Christian Bizer: Pay-as-you-go Data Integration (21/9/2010)
Source: Talk of Chris Bizer
9. Role of the dataset publisher
(more specific)*
• Reliability Is the data valid and complete?
• Peak-load Temporal profiles of important data?
• Performance Are caches and indexes optimal?
• Usefulness What do people find and use frequently?
• Attacks Is the data threatened by spam?
* w.r.t. Möller et al.: Learning from Linked
Open Data Usage: Patterns & Metrics.
11. How do people access resources on the Web of Data?
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600]
"GET /page/Jeroen_Simaeys HTTP/1.1"
200 26777 "" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:00 -0600]
"GET /resource/Guano_Apes HTTP/1.1"
303 0 "" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
xxx.xxx.xxx.xxx - - [21/Sep/2009:00:00:01 -0600]
"GET /sparql?query=PREFIX+rdfs%3A+%...“
200 1844 "" ""
What do they get?
• RDF-Graphs
• SPARQL Query Results XML Format
• …, HTML, JSON, … serialization of results
• …, HTML, JSON, … serialization of no results
204 would be great but for now the usage
mining process should respect this
12. Adapted from Myra Spilipoulou: “Web usage mining for
Web site evaluation”, 2000, Commun. ACM
Log
File Result Patterns
Instructions
Visualization Tool
Preparation Tool
Mining Query
Mining Results
Access Methods and Patterns
Navigation
Patterns
Queries Patterns Triples Filters Sessions
and Statistics
Sequences
Usage Mining
Methods
Prepared Log Data
Preparation Phase Mining Phase
14. Usage Analysis
• queries
• patterns
• triples
• primitives
ns1:A
rdf:type
Reference for details: M. Luczak-Rösch and H. Mühleisen,
ns2:B "Log File Analysis for Web of Data Endpoints ," in Proc. of
the 8th Extended Semantic Web Conference (ESWC)
Poster-Session, 2011.
15. Metrics
• Ontology heat map • Resource usage
– the amount a class or – triple combinations in
a predicate is used in which a resource is
queries used
• Primitive usage
– position in triples
– triple combinations
16. Metrics
• Time statistics • Error statistics
– hourly accesses – triple patterns that
contradict the schema
but succeeded
• Hosts statistics – triples patterns that
– hourly accesses per fail due to the
host modelling
– primitives and triple
patterns requested by
host
17. Visualizations
network
• weighted nodes overview
and edges
(depending on
the applied
metric) represent
the amount of
usage zoom in and see
details
18. Evaluation Dataset
• Dbpedia 3.3 log files
– 1.700.000 requests from two randomly chosen
days (07/2009)
– analysis against a mirror of the 3.3 dataset
(inconsistent dataset)
– performance issues of dynamic network
visualization and reprocessing of queries
limited number of analyzed logs
24. Inconsitencies & Weaknesses
• ns:Band ns:instrument ?x
inconsistent • ns:Band ns:genre ?y
data
• ns:Band ns:associatedBand ?z
• ns:Band ns:knownFor ?x missing facts
• ns:Band ns:nationality ?y
•…
Complete analysis can be found at http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-of-data-usage-dbpedia33/
25. What to learn from usage analysis?
• ontology maintenance
– schema evolution
– instance population
– ontology modularization
– error detection
Image source http://mrg.bz/GgaxPB
26. What else to learn?
• performance scaling
– index generation
– store architecture based on frequent SPARQL
patterns
– hardware scaling at peak times
– modularization of data for different hosts
27. This is ok for the beginning but…
… SONIVIS can do more
evaluate (with users!) various network visualizations
and find the best one for specific context
28. More for the Future
• Generic patterns for the metrics
+ resolution/evolution patterns
• Common sense of statistics
+ Quality-of-dataset index
Central conclusion:
• Temporal analysis Calculate statistics,
• Network metrics (degree,…) weaknesses and
inconsistencies first and
• Visualize the effects of change do visual editing
afterwards!
Image source: http://mrg.bz/8Co9lA
29. • usage-dependent life cycle support for
LOD vocabularies and the populated
instances
T A • (visual) usage analysis can help to plan
and perform maintenance activities
• this is a benefit for the dataset publisher
a w and the Web of data as a whole
k a
e y
Markus Luczak-Rösch (luczak@inf.fu-berlin.de)
Freie Universität Berlin, Networked Information Systems (www.ag-nbi.de) Image source: http://mrg.bz/jlObbL
Notes de l'éditeur
This is not an approach for all kind of domains but within LOD we find characteristic ontologies and vocabulariesdataset hosts do not know the requirements of the dataset users necessarily
round about 25 per cent of alldatsets were covered by the survey.that relates to the absolute number of datsets and not the amount of triples servedsome of the bigger ones replied such as dbpedia and bio2rdf