SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
Web Archiving 1/2
 The Web is a major source of published
information
 Content on the Web evolves and changes
continuously
 Many initiatives aim to archive the Web
 Petabytes of archived data
Web Archiving 2/2
 Web archives are incomplete
 Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
 Depth-first crawl, focus only on selected web sites
 Breadth-first crawl, focus on the entire domain,
but not in depth
Reconstruct Queries
 Our study: evolution of anchor text over time
to reconstruct what was important in the past
 Information that would be similar to user queries
 Inspiration:
 Document titles can be used as an approximation
of user queries [Jin et al.]
 Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
Queries in the Past
 User queries have usually not been preserved
 Impossible to reconstruct which queries the
user would have used to search the archive
 However, web archives contain more than the
Web page content
 E.g., page source, different timestamps (archive
date, last-modified date), link structure
Link evidence and anchor Text
 Link information represents the source URL,
destination URL, and the anchor text
 Anchor text is a short text describing the
destination page
 Has been shown to improve search effectiveness in a
large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
Data: Dutch Web Archive
 National Library of the Netherlands (KB)
 Depth-first (selective) Web archive
 Since 2007
 10+ TB
 8,000+ websites
 Our snapshot
 2009-2012
Link Processing
Filtering  text/html pages
 ~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering  text/html pages
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
 URL normalization; get host of
the source and the destination
 Clean spam e.g., rolex watches
Cleaning
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Link Processing
Filtering
 Pages of type text/html
 ~70% of archived objects
Extraction
 Source URL
 Destination URL
 Anchor text
 Crawl-date (YYYYMM)
Cleaning
 URL normalization; get host of the source
and the destination
 Clean spam e.g., rolex watches
Partitioning  Based on one-year and one-month granularity
Deduplication
 Remove duplicate links; due to crawling
frequency
 Same source, destination, and anchor text
Hosts Evolution
 Important hosts overtime
 Aggregate links based on the target host
 keep unique source hosts
 Multiple pages from same host linking to the same
target host are counted as one
 Rank hosts based on number of source hosts
linking to them
% of new hosts over the years
% New hosts in 2012 not
in {2009, 2010, and
2011}
Anchor Text Evolution
 Measure the importance of anchor text a over
time in time-partitioned links
 Aggregate by anchor text
 Compute the archive-based popularity
 Normalize by Maximum
% new anchor text over years
 Anchor text is new in specific partition if does
not appear in the previous partitions
 Based on one-year granularity
 59% new anchor text
 Based on one-month granularity
 34% new anchor text
WikiStats
 Views aggregation of Wikipedia (WP) pages
 From Jan 2008 to Jan 2015
 We focus on
 Feb 2009 to Dec 2012
 Similar to the period of our snapshot of the Dutch
Web archive
 Keep WP titles viewed >= 1,000 times
Matching anchor text to WP titles
 Pre-process WP titles like the anchor text
 Lowercase
 Stop-words removing
 One-year and one-month granularity partitions
 Collect titles by exact match with the anchors
 Assume anchor popularity equals WP page
popularity
Ranked anchor text with WP match
 Different rank cut-off
% overlap
decreases while
cut-off increases
~56 % in top-
1k has a match
Examples of popular anchor text (with match)
 Major cities in the Netherlands
 E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
 Social web sites
 E.g., twitter, linkedin, flickr, and vimeo
 Major Dutch daily newspapers
 E.g., de Volkskrant, Telegraaf, and Trouw
 Dutch public broadcasting
 uitzending gemist
 Government web service
 E.g., belastingdienst
Discussion
 Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
 Unfortunately we found only few examples
with our current analysis
 E.g., ‘‘canon’’ *
 However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]
References
 [Masanés06] J. Masanés. Web Archiving. Springer, 2006
 [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
 Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
 [CommonCrawl] https://commoncrawl.org/
 [WikiStats] http://wikistats.ins.cwi.nl/
Limitations & Future Work
 Exact text matching between anchor text and
WP title
 E.g., filmpje does not match WP title filmpje!
 Additional pre-processing
 Stemming, stopping, generalize from exact match to
match with low edit distance
 Our analysis is based on depth-first crawl of
few thousand of Dutch websites
 Breadth-first crawl such as [CommonCrawl]

Contenu connexe

Similaire à Temporal Anchor Text as Proxy for user Queries

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsThaer Samar
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfssuserc8e1481
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTMLRajesh Sanabada
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developingJawhar Ali
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features K.Mohamed Faizal
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An IntroductionSidrah Noor
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Rob Kocher
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODSEssam Obaid
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsJohn Breslin
 
Web publishing
Web publishingWeb publishing
Web publishingKanav Sood
 
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringChiara Fox Ogan
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1Lee Scott
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 

Similaire à Temporal Anchor Text as Proxy for user Queries (20)

Tpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawlsTpdl 2016 topic_coverage_indf-bf_crawls
Tpdl 2016 topic_coverage_indf-bf_crawls
 
FYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdfFYCOM Unit 1.pptx (2).pdf
FYCOM Unit 1.pptx (2).pdf
 
Web Design Basics and HTML
Web Design Basics and HTMLWeb Design Basics and HTML
Web Design Basics and HTML
 
FYCOM Unit 1.pptx
FYCOM Unit 1.pptxFYCOM Unit 1.pptx
FYCOM Unit 1.pptx
 
Web+html
Web+htmlWeb+html
Web+html
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Training report on web developing
Training report on web developingTraining report on web developing
Training report on web developing
 
Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features Explore SharePoint 2010 Enterprise & Document Management features
Explore SharePoint 2010 Enterprise & Document Management features
 
World wide web An Introduction
World wide web An IntroductionWorld wide web An Introduction
World wide web An Introduction
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)Knowledge Centers without a Taxonomy (KM World 2014)
Knowledge Centers without a Taxonomy (KM World 2014)
 
COLLECTION METHODS
COLLECTION METHODSCOLLECTION METHODS
COLLECTION METHODS
 
DM110 - Week 2 - Blogs
DM110 - Week 2 - BlogsDM110 - Week 2 - Blogs
DM110 - Week 2 - Blogs
 
Web publishing
Web publishingWeb publishing
Web publishing
 
Raju html
Raju htmlRaju html
Raju html
 
Content Analysis: Methods and Mentoring
Content Analysis: Methods and MentoringContent Analysis: Methods and Mentoring
Content Analysis: Methods and Mentoring
 
Html workshop 1
Html workshop 1Html workshop 1
Html workshop 1
 
Internet
InternetInternet
Internet
 
Web Pages
Web PagesWeb Pages
Web Pages
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 

Dernier

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Dernier (20)

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Temporal Anchor Text as Proxy for user Queries

  • 1. Temporal Anchor Text as Proxy for User Queries Thaer Samar, Arjen P. de Vries
  • 2. Web Archiving 1/2  The Web is a major source of published information  Content on the Web evolves and changes continuously  Many initiatives aim to archive the Web  Petabytes of archived data
  • 3. Web Archiving 2/2  Web archives are incomplete  Impossible to include all Web pages due to crawling limitations e.g., [Masanès06]  Depth-first crawl, focus only on selected web sites  Breadth-first crawl, focus on the entire domain, but not in depth
  • 4. Reconstruct Queries  Our study: evolution of anchor text over time to reconstruct what was important in the past  Information that would be similar to user queries  Inspiration:  Document titles can be used as an approximation of user queries [Jin et al.]  Anchor text exhibits characteristics similar to user query and document title [Eiron & McCurley]
  • 5. Queries in the Past  User queries have usually not been preserved  Impossible to reconstruct which queries the user would have used to search the archive  However, web archives contain more than the Web page content  E.g., page source, different timestamps (archive date, last-modified date), link structure
  • 6. Link evidence and anchor Text  Link information represents the source URL, destination URL, and the anchor text  Anchor text is a short text describing the destination page  Has been shown to improve search effectiveness in a large number of Information Retrieval studies ` Source http://www.cwi.nl Destination http://www.nwo.nl ‘NWO’
  • 7. Data: Dutch Web Archive  National Library of the Netherlands (KB)  Depth-first (selective) Web archive  Since 2007  10+ TB  8,000+ websites  Our snapshot  2009-2012
  • 8. Link Processing Filtering  text/html pages  ~70% of archived objects URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 9. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 10. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl >NWO </a> </html> Web Archive Record
  • 11. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 12. Link Processing Filtering  text/html pages  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Archive-date (YYYYMM) URL: http://www.cwi.nl Archive-Date: 20091201 Content-Type: text/html <html> <a href=http://www.nwo.nl> NWO </a> </html> Web Archive Record
  • 13. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM)  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Cleaning
  • 14. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity
  • 15. Link Processing Filtering  Pages of type text/html  ~70% of archived objects Extraction  Source URL  Destination URL  Anchor text  Crawl-date (YYYYMM) Cleaning  URL normalization; get host of the source and the destination  Clean spam e.g., rolex watches Partitioning  Based on one-year and one-month granularity Deduplication  Remove duplicate links; due to crawling frequency  Same source, destination, and anchor text
  • 16. Hosts Evolution  Important hosts overtime  Aggregate links based on the target host  keep unique source hosts  Multiple pages from same host linking to the same target host are counted as one  Rank hosts based on number of source hosts linking to them
  • 17. % of new hosts over the years % New hosts in 2012 not in {2009, 2010, and 2011}
  • 18. Anchor Text Evolution  Measure the importance of anchor text a over time in time-partitioned links  Aggregate by anchor text  Compute the archive-based popularity  Normalize by Maximum
  • 19. % new anchor text over years  Anchor text is new in specific partition if does not appear in the previous partitions  Based on one-year granularity  59% new anchor text  Based on one-month granularity  34% new anchor text
  • 20. WikiStats  Views aggregation of Wikipedia (WP) pages  From Jan 2008 to Jan 2015  We focus on  Feb 2009 to Dec 2012  Similar to the period of our snapshot of the Dutch Web archive  Keep WP titles viewed >= 1,000 times
  • 21. Matching anchor text to WP titles  Pre-process WP titles like the anchor text  Lowercase  Stop-words removing  One-year and one-month granularity partitions  Collect titles by exact match with the anchors  Assume anchor popularity equals WP page popularity
  • 22. Ranked anchor text with WP match  Different rank cut-off % overlap decreases while cut-off increases ~56 % in top- 1k has a match
  • 23. Examples of popular anchor text (with match)  Major cities in the Netherlands  E.g., Amsterdam, Rotterdam, Groningen, and Utrecht  Social web sites  E.g., twitter, linkedin, flickr, and vimeo  Major Dutch daily newspapers  E.g., de Volkskrant, Telegraaf, and Trouw  Dutch public broadcasting  uitzending gemist  Government web service  E.g., belastingdienst
  • 24. Discussion  Our original goal was to identify historically trending events from the link evolution recorded in the archive  Unfortunately we found only few examples with our current analysis  E.g., ‘‘canon’’ *  However, important anchor text provides and overview of important Dutch entities * corresponding to an activity initiated by the government to define the canonical historic events in Dutch history
  • 25. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]
  • 26. References  [Masanés06] J. Masanés. Web Archiving. Springer, 2006  [Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002  Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003  [CommonCrawl] https://commoncrawl.org/  [WikiStats] http://wikistats.ins.cwi.nl/
  • 27. Limitations & Future Work  Exact text matching between anchor text and WP title  E.g., filmpje does not match WP title filmpje!  Additional pre-processing  Stemming, stopping, generalize from exact match to match with low edit distance  Our analysis is based on depth-first crawl of few thousand of Dutch websites  Breadth-first crawl such as [CommonCrawl]