Hai huang presentation

Authors: Hai Huang, Fabien Gandon
Wimmics joint research team (Inria, Univ. Côte d’Azur, CNRS, I3S, France)
1

 To harvest enormous Linked Data Web, crawling techniques for Linked Data are
important.
 Linked Data Crawler aims to crawl RDF/RDFa data as much as possible (maximizing
coverage) on Linked Open Data Web while minimizing unnecessary downloads.
 One of challenges to Linked Data crawlers is to determine the RDF relevance of a URI
without downloading it.
 RDF relevant: Given a URI u, we call u RDF-relevant if the representation obtained by
dereferencing u contains RDF data; non-RDF relevant, otherwise.
2

 Existing work [1,2] use some heuristic rules to identify RDF-Relevant URIs.
 File Extension such as:*.rdf, *.owl, etc;
 HTTP Content-Type (HTTP Header) such as: application/rdf+xml, text/turtle, etc.;
 The coverage of crawling would be hurt by using these heuristic rules
 It can happen that the server does not provide the desired content type.
 It is reported that 17% of RDF/XML documents are returned with a content-type other
than application/rdf+xml.
 There are a large number of “ hard” URIs whose RDF relevance can not be
determined by employing these heuristic rules.
[1]Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search
engine. J. Web Sem. 9(4), 365{401 (2011)
[2]Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC
2010 Posters & Demonstrations Track (2010)
3

RDFa content:
<meta property="og:title"
content="The City in Late
Imperial China"/>
<meta property="og:type"
content="book"/>
<meta property="og:site_name"
content="Google Books"/>
4

 Prediction Task. Suppose that a hard URI ut is discovered from the
RDF graph G obtained by dereferencing a referring URI ur. Our task
is to build a learning model to predict whether ut is RDF-relevant
or not.
RDF relevant?URI ut
5

Given a target URI ut found in RDF data graph G contained in referring URI ur, the
features can be grouped into four sets:
 URI information and HTTP Header information of target URI ut (including host, path,
port etc. ), denoted by feature set F+t:
F+t={ feature_u_scheme=`https’,
feature_u_authority=`john.doe@www.example.com:123’,
feature_u_ host=`www.example.com’,
feature_u_ userinfo=`john.doe’,
feature_u_ path=`forum/questions’,
feature_u_ query=`?tag=networking&order=newest’,
feature_u_ fragment=`top’,
feature_u_Content-Type=‘html’}
7

• Feature set F+r: URI information of referring URI ur;
 Feature set F+p: URI information of direct properties which connect URI ut in
RDF graph G;
 Feature set F+x: URI information, similarity and relevance features of the sibling
URIs of ut in RDF graph G;
 Sibling URIs: URIs that have a common subject-property or property-object
pair with ut in G
8

 Feature hashing is a fast and space-efficient way of vectorizing features. It works by
applying a hash function to the features and using their hash values as indices
directly in a vector. Dictionary is not needed in advance.
9

• Online Learning is a kind of machine learning methods in a streaming data
setting, i.e., training a model in consecutive rounds.
• Why Online learning Model?
• The environment changes over time. The linked data crawler operates in a dynamic
environment, it requires the predictor is able to update itself during crawling.
• Consecutive observations are not i.i.d. In our scenario, training examples (URIs)
become available in a sequential order and the i.i.d assumption does not hold.
• We develop our online prediction model based on FTRL- proximal (Follow the
Regularized Leader) learning algorithm which is developed by Google for the
massive online learning problem of predicting ad click–through rates (CTR).
10

Given a sequence of URIs u1 , u2 , …, ur … uT , we predict the RDF relevance of them by:
At round t + 1, the FTRL-Proximal algorithm uses the update formula:
𝞃 is a decision value
stabilization penaltysubgradient
approximation
L1 penalty
11

 FTRL-proximal is very efficient and effective at producing sparse models.
 A sparse model corresponds to a weight vector with few non-zero entries.
 Subsampling
 To observe the true class label of a URI we have to download it and check whether the
representation contains RDF/RDFa data or not.
 For those URIs predicted negative, we only select a fraction ε ∈ (0, 1] of them to
download and observe their true class label.
 Here ε is a balance between online prediction precision and network overhead.
12

 The crawler starts from a list of seed URIs and operates an breadth-first crawling.
 To avoid issuing too many consecutive HTTP requests to a server, the URIs in
Frontier into different sets based on their Pay Level Domains(PLDs). URIs are
polled from PLD sets in a round-robin fashion,
 Once a URI ut is polled, the predictor predicts the class label yt of ut , If yt = 1, we
retrieve the content of ut; If yt = 0, we retrieve the content of ut with probability ε.
13

 Aim: We evaluate the predictive ability of different combinations of feature
sets.
 Dataset. The dataset is generated by operating a Breadth-First Search (BFS)
crawl. It contains103K URIs from 9,825 different hosts.
 Static classifiers. The static classifiers including SVM, KNN and Naive Bayes
are used to examine the predictive performance of different combinations of
feature sets.
15

Observations:
• The feature combination F+t+r+p (without the sibling feature set) got the best performance.
• The features from target URI ut are important.
16

 We evaluated the performance of the proposed crawler with online prediction
component against several baseline (offline) crawlers.
 Methods. We implemented the proposed crawler denoted by LDCOC (Linked Data
Crawler with Online Classifier). We also implemented three crawlers with offline
classifiers including SVM, KNN and Naive Bayes to select URIs.
 Metrics. We use a percentage measure that equals the ratio of retrieved RDF-relevant
URIs to the total number of URIs crawled:
17

 Conclusion
 We have presented a solution to learn URI selection criteria in order to improve the
crawling of Linked Open Data by predicting their RDF-relevance.
 We build a prediction model based on FTRL-proximal learning algorithm and implement
a crawler for Linked Open Data.
 Our method can be generalized to crawl other kinds of Linked Data formats such as
JSON-LD, Microdata, etc.
 Future work
 Investigating more features such as the subgraphs induced by URIs;
 Employing the parallelized version of the FTRL-proximal learning algorithm;
19

Hai huang presentation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à Hai huang presentation

Similaire à Hai huang presentation (20)

Dernier

Dernier (20)

Hai huang presentation

Notes de l'éditeur