SlideShare une entreprise Scribd logo
1  sur  22
Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures Martin Klein, Jeb Ware, Michael L. Nelson {mklein,jware,mln}@cs.odu.edu JCDL 2011 Ottawa, Canada 07/14/2011
2 The Problem
3 The Problem
4 The Problem
Previously on… Missing Web Pages The Problem There was a (IKEA) web page Use Memento to get an archived copy ,[object Object]
Generate lexical signature [Klein2010b]		 Query search engines What if there is no archived copy? ,[object Object],Link neighborhood lexical signatures 5
Link Neighborhood The Problem #1 A is about IKEA #2 B Bjoern Oslo Dorm room Nobel Herring #3 C extract 6
7 Lexical Signatures (LSs) First introduced by Phelps and Wilensky[Phelps2000] Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract 10,000 terms 200 terms
Research Questions The Problem What is a good length of link neighborhood lexical signatures? ,[object Object]
5..8 for tags [Klein2011]How many backlinks to include? The more backlinklevels the better?  What radius on the backlink page to use? 8
The Radius on a Backlink Page The Problem Entire page Paragraph Anchor text 9
10 The Dataset 309 URIs [Klein2010b] 28,325 first level 306,700 second level backlinks Filter for language, file type, etc.   12% discarded ,[object Object]
IDF values from Yahoo! [Klein2010b]
1..7 and 10 termsQuery Yahoo! API Compute “goodness” (nDCG)
The Results The Problem 1st and 2nd level level-radius-rank better 11
The Results – Radius The Problem All Radii level-radius-rank 12
The Results – Backlink Rank The Problem Ranks 10 100 1000 level-radius-rank 13
The Results – In Numbers The Problem GOOD 1-anchor-1000 WINNER 1-anchor-10 14
Synchronicity Concluding Remarks Firefox add-on Triggers on 404 error Rediscover page via: Title Lexical signature Tags Link neighborhood lexical signature URI modification http://bit.ly/no-more-404 Example: conference home page   ,[object Object],15
16 In Conclusion…
Conclusions and Future Work Concluding Remarks Optimal link neighborhood lexical signatures: ,[object Object]
Parsed from top 10backlink pages

Contenu connexe

Plus de Martin Klein

Plus de Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures

  • 1. Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures Martin Klein, Jeb Ware, Michael L. Nelson {mklein,jware,mln}@cs.odu.edu JCDL 2011 Ottawa, Canada 07/14/2011
  • 5.
  • 6.
  • 7. Link Neighborhood The Problem #1 A is about IKEA #2 B Bjoern Oslo Dorm room Nobel Herring #3 C extract 6
  • 8. 7 Lexical Signatures (LSs) First introduced by Phelps and Wilensky[Phelps2000] Small set of terms capturing “aboutness” of a document, “lightweight” metadata Resource Abstract 10,000 terms 200 terms
  • 9.
  • 10. 5..8 for tags [Klein2011]How many backlinks to include? The more backlinklevels the better? What radius on the backlink page to use? 8
  • 11. The Radius on a Backlink Page The Problem Entire page Paragraph Anchor text 9
  • 12.
  • 13. IDF values from Yahoo! [Klein2010b]
  • 14. 1..7 and 10 termsQuery Yahoo! API Compute “goodness” (nDCG)
  • 15. The Results The Problem 1st and 2nd level level-radius-rank better 11
  • 16. The Results – Radius The Problem All Radii level-radius-rank 12
  • 17. The Results – Backlink Rank The Problem Ranks 10 100 1000 level-radius-rank 13
  • 18. The Results – In Numbers The Problem GOOD 1-anchor-1000 WINNER 1-anchor-10 14
  • 19.
  • 21.
  • 22. Parsed from top 10backlink pages
  • 24.
  • 25. References Concluding Remarks Jones73 K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973 Klein2008 M.Klein, M.L.Nelson,“Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382 Klein2010a M.Klein, J.Shipman, M.L.Nelson,“Is This a Good Title”, Hypertext 2010, pp. 3-12 Klein2010b M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure”, JCDL 2010, pp. 59-68 Klein2011 M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the (Re-)Discovery of Web Pages”, TPDL 2011 to appear Phelps2000 T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, Univesity of California at Berkeley, 2000 18
  • 26. Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures Martin Klein, Jeb Ware, Michael L. Nelson {mklein,jware,mln}@cs.odu.edu
  • 27.
  • 28. The Results – Backlink Level The Problem Anchor text ± 10 words level-radius-rank 22