We present a solution to learn URI selection criteria in order to improve the crawling of Linked Open Data by predicting their RDF-relevance. The prediction component is able to predict whether a newly discovered URI contains RDF content or not by extracting features from several sources and building a prediction model based on FTRL-proximal online learning algorithm. The experimental results demonstrate that the coverage of the crawl is improved compared to baseline methods.
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Hai huang presentation
1. Authors: Hai Huang, Fabien Gandon
Wimmics joint research team (Inria, Univ. Côte d’Azur, CNRS, I3S, France)
1
2. To harvest enormous Linked Data Web, crawling techniques for Linked Data are
important.
Linked Data Crawler aims to crawl RDF/RDFa data as much as possible (maximizing
coverage) on Linked Open Data Web while minimizing unnecessary downloads.
One of challenges to Linked Data crawlers is to determine the RDF relevance of a URI
without downloading it.
RDF relevant: Given a URI u, we call u RDF-relevant if the representation obtained by
dereferencing u contains RDF data; non-RDF relevant, otherwise.
2
3. Existing work [1,2] use some heuristic rules to identify RDF-Relevant URIs.
File Extension such as:*.rdf, *.owl, etc;
HTTP Content-Type (HTTP Header) such as: application/rdf+xml, text/turtle, etc.;
The coverage of crawling would be hurt by using these heuristic rules
It can happen that the server does not provide the desired content type.
It is reported that 17% of RDF/XML documents are returned with a content-type other
than application/rdf+xml.
There are a large number of “ hard” URIs whose RDF relevance can not be
determined by employing these heuristic rules.
[1]Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search
engine. J. Web Sem. 9(4), 365{401 (2011)
[2]Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC
2010 Posters & Demonstrations Track (2010)
3
5. Prediction Task. Suppose that a hard URI ut is discovered from the
RDF graph G obtained by dereferencing a referring URI ur. Our task
is to build a learning model to predict whether ut is RDF-relevant
or not.
RDF relevant?URI ut
5
7. Given a target URI ut found in RDF data graph G contained in referring URI ur, the
features can be grouped into four sets:
URI information and HTTP Header information of target URI ut (including host, path,
port etc. ), denoted by feature set F+t:
F+t={ feature_u_scheme=`https’,
feature_u_authority=`john.doe@www.example.com:123’,
feature_u_ host=`www.example.com’,
feature_u_ userinfo=`john.doe’,
feature_u_ path=`forum/questions’,
feature_u_ query=`?tag=networking&order=newest’,
feature_u_ fragment=`top’,
feature_u_Content-Type=‘html’}
7
8. • Feature set F+r: URI information of referring URI ur;
Feature set F+p: URI information of direct properties which connect URI ut in
RDF graph G;
Feature set F+x: URI information, similarity and relevance features of the sibling
URIs of ut in RDF graph G;
Sibling URIs: URIs that have a common subject-property or property-object
pair with ut in G
8
9. Feature hashing is a fast and space-efficient way of vectorizing features. It works by
applying a hash function to the features and using their hash values as indices
directly in a vector. Dictionary is not needed in advance.
9
10. • Online Learning is a kind of machine learning methods in a streaming data
setting, i.e., training a model in consecutive rounds.
• Why Online learning Model?
• The environment changes over time. The linked data crawler operates in a dynamic
environment, it requires the predictor is able to update itself during crawling.
• Consecutive observations are not i.i.d. In our scenario, training examples (URIs)
become available in a sequential order and the i.i.d assumption does not hold.
• We develop our online prediction model based on FTRL- proximal (Follow the
Regularized Leader) learning algorithm which is developed by Google for the
massive online learning problem of predicting ad click–through rates (CTR).
10
11. Given a sequence of URIs u1 , u2 , …, ur … uT , we predict the RDF relevance of them by:
At round t + 1, the FTRL-Proximal algorithm uses the update formula:
𝞃 is a decision value
stabilization penaltysubgradient
approximation
L1 penalty
11
12. FTRL-proximal is very efficient and effective at producing sparse models.
A sparse model corresponds to a weight vector with few non-zero entries.
Subsampling
To observe the true class label of a URI we have to download it and check whether the
representation contains RDF/RDFa data or not.
For those URIs predicted negative, we only select a fraction ε ∈ (0, 1] of them to
download and observe their true class label.
Here ε is a balance between online prediction precision and network overhead.
12
13. The crawler starts from a list of seed URIs and operates an breadth-first crawling.
To avoid issuing too many consecutive HTTP requests to a server, the URIs in
Frontier into different sets based on their Pay Level Domains(PLDs). URIs are
polled from PLD sets in a round-robin fashion,
Once a URI ut is polled, the predictor predicts the class label yt of ut , If yt = 1, we
retrieve the content of ut; If yt = 0, we retrieve the content of ut with probability ε.
13
15. Aim: We evaluate the predictive ability of different combinations of feature
sets.
Dataset. The dataset is generated by operating a Breadth-First Search (BFS)
crawl. It contains103K URIs from 9,825 different hosts.
Static classifiers. The static classifiers including SVM, KNN and Naive Bayes
are used to examine the predictive performance of different combinations of
feature sets.
15
16. Observations:
• The feature combination F+t+r+p (without the sibling feature set) got the best performance.
• The features from target URI ut are important.
16
17. We evaluated the performance of the proposed crawler with online prediction
component against several baseline (offline) crawlers.
Methods. We implemented the proposed crawler denoted by LDCOC (Linked Data
Crawler with Online Classifier). We also implemented three crawlers with offline
classifiers including SVM, KNN and Naive Bayes to select URIs.
Metrics. We use a percentage measure that equals the ratio of retrieved RDF-relevant
URIs to the total number of URIs crawled:
17
19. Conclusion
We have presented a solution to learn URI selection criteria in order to improve the
crawling of Linked Open Data by predicting their RDF-relevance.
We build a prediction model based on FTRL-proximal learning algorithm and implement
a crawler for Linked Open Data.
Our method can be generalized to crawl other kinds of Linked Data formats such as
JSON-LD, Microdata, etc.
Future work
Investigating more features such as the subgraphs induced by URIs;
Employing the parallelized version of the FTRL-proximal learning algorithm;
19
In order to harvest this enormous Linked Data Web, crawling techniques for Linked Data are becoming increasingly important.
The Web is evolving from a “Web of linked documents” into a “Web of linked data”.
Linked Data
URIs (Uniform Resource Identifier) for naming things, RDF for describing data and SPARQL for querying it.
A large amount of RDF triples are available as linked data in various domains such as health, publication, agriculture, music, etc.
Well known linked datasets such as DBpedia, WikiData, Freebase
HTTP headers are mainly intended for the communication between the server and client in both directions.
By hashing the features we can gain signicant advantages in memory usage since we can bound the size of generated binary vectors and we do not need a pre-built dictionary.
MurmurHash3
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
subgradient approximation, where we approximate the prediction error, ft (! ), by its subgradient at !t ,
all previous rounds are approximated using subgradient (linear) approximation. more “conservative” for the value of 𝐠𝑡 by using the average of 𝐠1:𝑡 Some convex functions which are not necessarily differentiable.
stabilization penalty, However, we don't want to change our hypothesis 𝑥𝑡 too much (for fear of predicting badly on examples we have already seen). 𝜇𝑡 is a step size for this sample, and it should make each step more conservative.
𝝀 is a weighting factor, , this term is a regularization term used to induce sparsity
In experiments, we set ε as the ratio of the number of positive URIs to the number of negative URIs in the training set.
To deal with the bias of this subsampled data, we assign an importance weight 1/ ε to these training examples.
Experiments are divided into two parts. The first experiment is about evaluating the predicative ability of different …
we report the results for the 4-element combination of feature sets ( F+t+r+p+x), all 3-element combinations and the 2-element combinations
We also found that the performance of F+r+p+x which is derived by
excluding feature set F+t from F+t+r+p+x decreases a lot