SlideShare une entreprise Scribd logo
Authors: Hai Huang, Fabien Gandon
Wimmics joint research team (Inria, Univ. Côte d’Azur, CNRS, I3S, France)
1
 To harvest enormous Linked Data Web, crawling techniques for Linked Data are
important.
 Linked Data Crawler aims to crawl RDF/RDFa data as much as possible (maximizing
coverage) on Linked Open Data Web while minimizing unnecessary downloads.
 One of challenges to Linked Data crawlers is to determine the RDF relevance of a URI
without downloading it.
 RDF relevant: Given a URI u, we call u RDF-relevant if the representation obtained by
dereferencing u contains RDF data; non-RDF relevant, otherwise.
2
 Existing work [1,2] use some heuristic rules to identify RDF-Relevant URIs.
 File Extension such as:*.rdf, *.owl, etc;
 HTTP Content-Type (HTTP Header) such as: application/rdf+xml, text/turtle, etc.;
 The coverage of crawling would be hurt by using these heuristic rules
 It can happen that the server does not provide the desired content type.
 It is reported that 17% of RDF/XML documents are returned with a content-type other
than application/rdf+xml.
 There are a large number of “ hard” URIs whose RDF relevance can not be
determined by employing these heuristic rules.
[1]Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search
engine. J. Web Sem. 9(4), 365{401 (2011)
[2]Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC
2010 Posters & Demonstrations Track (2010)
3
RDFa content:
<meta property="og:title"
content="The City in Late
Imperial China"/>
<meta property="og:type"
content="book"/>
<meta property="og:site_name"
content="Google Books"/>
4
 Prediction Task. Suppose that a hard URI ut is discovered from the
RDF graph G obtained by dereferencing a referring URI ur. Our task
is to build a learning model to predict whether ut is RDF-relevant
or not.
RDF relevant?URI ut
5
6
Given a target URI ut found in RDF data graph G contained in referring URI ur, the
features can be grouped into four sets:
 URI information and HTTP Header information of target URI ut (including host, path,
port etc. ), denoted by feature set F+t:
F+t={ feature_u_scheme=`https’,
feature_u_authority=`john.doe@www.example.com:123’,
feature_u_ host=`www.example.com’,
feature_u_ userinfo=`john.doe’,
feature_u_ path=`forum/questions’,
feature_u_ query=`?tag=networking&order=newest’,
feature_u_ fragment=`top’,
feature_u_Content-Type=‘html’}
7
• Feature set F+r: URI information of referring URI ur;
 Feature set F+p: URI information of direct properties which connect URI ut in
RDF graph G;
 Feature set F+x: URI information, similarity and relevance features of the sibling
URIs of ut in RDF graph G;
 Sibling URIs: URIs that have a common subject-property or property-object
pair with ut in G
8
 Feature hashing is a fast and space-efficient way of vectorizing features. It works by
applying a hash function to the features and using their hash values as indices
directly in a vector. Dictionary is not needed in advance.
9
• Online Learning is a kind of machine learning methods in a streaming data
setting, i.e., training a model in consecutive rounds.
• Why Online learning Model?
• The environment changes over time. The linked data crawler operates in a dynamic
environment, it requires the predictor is able to update itself during crawling.
• Consecutive observations are not i.i.d. In our scenario, training examples (URIs)
become available in a sequential order and the i.i.d assumption does not hold.
• We develop our online prediction model based on FTRL- proximal (Follow the
Regularized Leader) learning algorithm which is developed by Google for the
massive online learning problem of predicting ad click–through rates (CTR).
10
Given a sequence of URIs u1 , u2 , …, ur … uT , we predict the RDF relevance of them by:
At round t + 1, the FTRL-Proximal algorithm uses the update formula:
𝞃 is a decision value
stabilization penaltysubgradient
approximation
L1 penalty
11
 FTRL-proximal is very efficient and effective at producing sparse models.
 A sparse model corresponds to a weight vector with few non-zero entries.
 Subsampling
 To observe the true class label of a URI we have to download it and check whether the
representation contains RDF/RDFa data or not.
 For those URIs predicted negative, we only select a fraction ε ∈ (0, 1] of them to
download and observe their true class label.
 Here ε is a balance between online prediction precision and network overhead.
12
 The crawler starts from a list of seed URIs and operates an breadth-first crawling.
 To avoid issuing too many consecutive HTTP requests to a server, the URIs in
Frontier into different sets based on their Pay Level Domains(PLDs). URIs are
polled from PLD sets in a round-robin fashion,
 Once a URI ut is polled, the predictor predicts the class label yt of ut , If yt = 1, we
retrieve the content of ut; If yt = 0, we retrieve the content of ut with probability ε.
13
14
 Aim: We evaluate the predictive ability of different combinations of feature
sets.
 Dataset. The dataset is generated by operating a Breadth-First Search (BFS)
crawl. It contains103K URIs from 9,825 different hosts.
 Static classifiers. The static classifiers including SVM, KNN and Naive Bayes
are used to examine the predictive performance of different combinations of
feature sets.
15
Observations:
• The feature combination F+t+r+p (without the sibling feature set) got the best performance.
• The features from target URI ut are important.
16
 We evaluated the performance of the proposed crawler with online prediction
component against several baseline (offline) crawlers.
 Methods. We implemented the proposed crawler denoted by LDCOC (Linked Data
Crawler with Online Classifier). We also implemented three crawlers with offline
classifiers including SVM, KNN and Naive Bayes to select URIs.
 Metrics. We use a percentage measure that equals the ratio of retrieved RDF-relevant
URIs to the total number of URIs crawled:
17
after crawling 300K URIs.
18
 Conclusion
 We have presented a solution to learn URI selection criteria in order to improve the
crawling of Linked Open Data by predicting their RDF-relevance.
 We build a prediction model based on FTRL-proximal learning algorithm and implement
a crawler for Linked Open Data.
 Our method can be generalized to crawl other kinds of Linked Data formats such as
JSON-LD, Microdata, etc.
 Future work
 Investigating more features such as the subgraphs induced by URIs;
 Employing the parallelized version of the FTRL-proximal learning algorithm;
19
20

Contenu connexe

Tendances

SampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic RelevanceSampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic Relevancelaurensrietveld
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksMarko Rodriguez
 
A metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataA metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataRaphael do Vale
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in ComputingMarko Rodriguez
 
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II ajay pashankar
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalHarry Potter
 
An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With DataBlue Elephant Consulting
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Olaf Hartig
 

Tendances (16)

SampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic RelevanceSampLD, Structural Properties as Proxy for Semantic Relevance
SampLD, Structural Properties as Proxy for Semantic Relevance
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative Networks
 
Unit 3
Unit 3Unit 3
Unit 3
 
A metadata focused crawler for Linked Data
A metadata focused crawler for Linked DataA metadata focused crawler for Linked Data
A metadata focused crawler for Linked Data
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
The Network Data Structure in Computing
The Network Data Structure in ComputingThe Network Data Structure in Computing
The Network Data Structure in Computing
 
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II Unit i  data structure FYCS MUMBAI UNIVERSITY SEM II
Unit i data structure FYCS MUMBAI UNIVERSITY SEM II
 
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationRethinking Online SPARQL Querying to Support Incremental Result Visualization
Rethinking Online SPARQL Querying to Support Incremental Result Visualization
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
An Introduction To Python - Working With Data
An Introduction To Python - Working With DataAn Introduction To Python - Working With Data
An Introduction To Python - Working With Data
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
Tutorial "Linked Data Query Processing" Part 5 "Query Planning and Optimizati...
 

Similaire à Hai huang presentation

OpenCalais in Linked Data context
OpenCalais in Linked Data contextOpenCalais in Linked Data context
OpenCalais in Linked Data contexteldorina
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015Cason Snow
 
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ... Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...Vladimir Alexiev, PhD, PMP
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval byIJNSA Journal
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Editor IJARCET
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Serverwebhostingguy
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Websamar_slideshare
 
Comparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHPComparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHPMSGUNC
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slideSovan Misra
 
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...CONUL Conference
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutionsOpen Data Support
 
Making data FAIR using InterMine
Making data FAIR using InterMineMaking data FAIR using InterMine
Making data FAIR using InterMineJustin Clark-Casey
 

Similaire à Hai huang presentation (20)

OpenCalais in Linked Data context
OpenCalais in Linked Data contextOpenCalais in Linked Data context
OpenCalais in Linked Data context
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
Linked sensor data
Linked sensor dataLinked sensor data
Linked sensor data
 
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ... Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
Comparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHPComparative study on the processing of RDF in PHP
Comparative study on the processing of RDF in PHP
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
‘Facilitating User Engagement by Enriching Library Data using Semantic Techno...
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Llinked open data training for EU institutions
Llinked open data training for EU institutionsLlinked open data training for EU institutions
Llinked open data training for EU institutions
 
Making data FAIR using InterMine
Making data FAIR using InterMineMaking data FAIR using InterMine
Making data FAIR using InterMine
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 

Dernier

How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like BitcoinDOT TECH
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxbenishzehra469
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...elinavihriala
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 

Dernier (20)

How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 

Hai huang presentation

  • 1. Authors: Hai Huang, Fabien Gandon Wimmics joint research team (Inria, Univ. Côte d’Azur, CNRS, I3S, France) 1
  • 2.  To harvest enormous Linked Data Web, crawling techniques for Linked Data are important.  Linked Data Crawler aims to crawl RDF/RDFa data as much as possible (maximizing coverage) on Linked Open Data Web while minimizing unnecessary downloads.  One of challenges to Linked Data crawlers is to determine the RDF relevance of a URI without downloading it.  RDF relevant: Given a URI u, we call u RDF-relevant if the representation obtained by dereferencing u contains RDF data; non-RDF relevant, otherwise. 2
  • 3.  Existing work [1,2] use some heuristic rules to identify RDF-Relevant URIs.  File Extension such as:*.rdf, *.owl, etc;  HTTP Content-Type (HTTP Header) such as: application/rdf+xml, text/turtle, etc.;  The coverage of crawling would be hurt by using these heuristic rules  It can happen that the server does not provide the desired content type.  It is reported that 17% of RDF/XML documents are returned with a content-type other than application/rdf+xml.  There are a large number of “ hard” URIs whose RDF relevance can not be determined by employing these heuristic rules. [1]Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365{401 (2011) [2]Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: An open-source crawling framework for the web of linked data. In: Proceedings of the ISWC 2010 Posters & Demonstrations Track (2010) 3
  • 4. RDFa content: <meta property="og:title" content="The City in Late Imperial China"/> <meta property="og:type" content="book"/> <meta property="og:site_name" content="Google Books"/> 4
  • 5.  Prediction Task. Suppose that a hard URI ut is discovered from the RDF graph G obtained by dereferencing a referring URI ur. Our task is to build a learning model to predict whether ut is RDF-relevant or not. RDF relevant?URI ut 5
  • 6. 6
  • 7. Given a target URI ut found in RDF data graph G contained in referring URI ur, the features can be grouped into four sets:  URI information and HTTP Header information of target URI ut (including host, path, port etc. ), denoted by feature set F+t: F+t={ feature_u_scheme=`https’, feature_u_authority=`john.doe@www.example.com:123’, feature_u_ host=`www.example.com’, feature_u_ userinfo=`john.doe’, feature_u_ path=`forum/questions’, feature_u_ query=`?tag=networking&order=newest’, feature_u_ fragment=`top’, feature_u_Content-Type=‘html’} 7
  • 8. • Feature set F+r: URI information of referring URI ur;  Feature set F+p: URI information of direct properties which connect URI ut in RDF graph G;  Feature set F+x: URI information, similarity and relevance features of the sibling URIs of ut in RDF graph G;  Sibling URIs: URIs that have a common subject-property or property-object pair with ut in G 8
  • 9.  Feature hashing is a fast and space-efficient way of vectorizing features. It works by applying a hash function to the features and using their hash values as indices directly in a vector. Dictionary is not needed in advance. 9
  • 10. • Online Learning is a kind of machine learning methods in a streaming data setting, i.e., training a model in consecutive rounds. • Why Online learning Model? • The environment changes over time. The linked data crawler operates in a dynamic environment, it requires the predictor is able to update itself during crawling. • Consecutive observations are not i.i.d. In our scenario, training examples (URIs) become available in a sequential order and the i.i.d assumption does not hold. • We develop our online prediction model based on FTRL- proximal (Follow the Regularized Leader) learning algorithm which is developed by Google for the massive online learning problem of predicting ad click–through rates (CTR). 10
  • 11. Given a sequence of URIs u1 , u2 , …, ur … uT , we predict the RDF relevance of them by: At round t + 1, the FTRL-Proximal algorithm uses the update formula: 𝞃 is a decision value stabilization penaltysubgradient approximation L1 penalty 11
  • 12.  FTRL-proximal is very efficient and effective at producing sparse models.  A sparse model corresponds to a weight vector with few non-zero entries.  Subsampling  To observe the true class label of a URI we have to download it and check whether the representation contains RDF/RDFa data or not.  For those URIs predicted negative, we only select a fraction ε ∈ (0, 1] of them to download and observe their true class label.  Here ε is a balance between online prediction precision and network overhead. 12
  • 13.  The crawler starts from a list of seed URIs and operates an breadth-first crawling.  To avoid issuing too many consecutive HTTP requests to a server, the URIs in Frontier into different sets based on their Pay Level Domains(PLDs). URIs are polled from PLD sets in a round-robin fashion,  Once a URI ut is polled, the predictor predicts the class label yt of ut , If yt = 1, we retrieve the content of ut; If yt = 0, we retrieve the content of ut with probability ε. 13
  • 14. 14
  • 15.  Aim: We evaluate the predictive ability of different combinations of feature sets.  Dataset. The dataset is generated by operating a Breadth-First Search (BFS) crawl. It contains103K URIs from 9,825 different hosts.  Static classifiers. The static classifiers including SVM, KNN and Naive Bayes are used to examine the predictive performance of different combinations of feature sets. 15
  • 16. Observations: • The feature combination F+t+r+p (without the sibling feature set) got the best performance. • The features from target URI ut are important. 16
  • 17.  We evaluated the performance of the proposed crawler with online prediction component against several baseline (offline) crawlers.  Methods. We implemented the proposed crawler denoted by LDCOC (Linked Data Crawler with Online Classifier). We also implemented three crawlers with offline classifiers including SVM, KNN and Naive Bayes to select URIs.  Metrics. We use a percentage measure that equals the ratio of retrieved RDF-relevant URIs to the total number of URIs crawled: 17
  • 19.  Conclusion  We have presented a solution to learn URI selection criteria in order to improve the crawling of Linked Open Data by predicting their RDF-relevance.  We build a prediction model based on FTRL-proximal learning algorithm and implement a crawler for Linked Open Data.  Our method can be generalized to crawl other kinds of Linked Data formats such as JSON-LD, Microdata, etc.  Future work  Investigating more features such as the subgraphs induced by URIs;  Employing the parallelized version of the FTRL-proximal learning algorithm; 19
  • 20. 20

Notes de l'éditeur

  1.  In order to harvest this enormous Linked Data Web, crawling techniques for Linked Data are becoming increasingly important. The Web is evolving from a “Web of linked documents” into a “Web of linked data”. Linked Data URIs (Uniform Resource Identifier) for naming things, RDF for describing data and SPARQL for querying it. A large amount of RDF triples are available as linked data in various domains such as health, publication, agriculture, music, etc. Well known linked datasets such as DBpedia, WikiData, Freebase
  2. HTTP headers are mainly intended for the communication between the server and client in both directions.
  3. By hashing the features we can gain signicant advantages in memory usage since we can bound the size of generated binary vectors and we do not need a pre-built dictionary. MurmurHash3
  4. In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
  5. subgradient approximation,   where we approximate the prediction error, ft (! ),  by its subgradient at !t , all previous rounds are approximated using subgradient (linear) approximation. more “conservative” for the value of 𝐠𝑡 by using the average of 𝐠1:𝑡 Some convex functions which are not necessarily differentiable. stabilization penalty, However, we don't want to change our hypothesis 𝑥𝑡 too much (for fear of predicting badly on examples we have already seen). 𝜇𝑡 is a step size for this sample, and it should make each step more conservative. 𝝀 is a weighting factor, , this term is a regularization term used to induce sparsity
  6. In experiments, we set ε as the ratio of the number of positive URIs to the number of negative URIs in the training set. To deal with the bias of this subsampled data, we assign an importance weight 1/ ε to these training examples.
  7. Experiments are divided into two parts. The first experiment is about evaluating the predicative ability of different …
  8. we report the results for the 4-element combination of feature sets ( F+t+r+p+x), all 3-element combinations and the 2-element combinations We also found that the performance of F+r+p+x  which is derived by excluding feature set F+t  from F+t+r+p+x  decreases a lot