SlideShare une entreprise Scribd logo
1  sur  25
BlogForever Crawler:
Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
1
Contents
• Introduction:
– The disappearing web,
– Blog archiving.
• Our Contributions
• Algorithms
– Motivation,
– Blog content extraction,
– Extraction rules,
– Variations for authors, dates and comments.
• System Architecture
• Evaluation
– Comparison with 3 web article extraction systems.
• Issues and Future Work
2
The disappearing web
Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/
3
Blog archiving
1. Why archive the web?
– Web archiving is the process of collecting portions of
the World Wide Web to ensure the information
is preserved in an archive for future researchers, historians,
and the public.
2. Blog archiving is a special case of web archiving.
3. The blogosphere is a live record of contemporary Society,
Culture, Science and Economy.
4. Some blogs contain unique data and valuable information.
– Users take action and make important decisions based on
this information.
5. We have a Responsibility to preserve the web.
4
5
Blog crawlers
 Real-time monitoring
 Html data extraction engine
 Spam filtering
Unstructured
information
Original data and
XML metadata
Blog digital repository
 Digital preservation
 Quality assurance
 Collections curation
 Public access APIs
 Personalised services
 Information retrieval
 Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web services
Web interface
FP7 EC Funded Project
http://blogforever.eu/
Our Contributions
• A web crawler capable of extracting blog articles,
authors, publication dates and comments.
• A new algorithm to build extraction rules from
blog web feeds with linear time complexity,
• Applications of the algorithm to extract authors,
publication dates and comments,
• A new web crawler architecture, including how we
use a complete web browser to render JavaScript
web pages before processing them.
• Extensive evaluation of the content extraction and
execution time of our algorithm against three
state-of-the-art web article extraction algorithms.
6
Motivation
• Extracting metadata and content from HTML
documents is a challenging task.
– Web standards usage is low (<0.5% of websites).
– More than 95% of websites do not pass HTML validation.
• Having blogs as our target websites, we made the
following observations which play a central role in the
extraction process:
a) Blogs provide web feeds: structured and standardized
XML views of the latest posts of a blog,
b) Posts of the same blog share a similar HTML structure.
c) Web feeds usually have 10-20 posts whereas blogs
contain a lot more. We have to access more posts than
the ones referenced in web feeds.
7
Content Extraction Overview
1. Use blog web feeds and referenced HTML pages
as training data to build extraction rules.
2. Extraction rules capable of locating in HTML
page all RSS referenced elements such as:
1. Title,
2. Author,
3. Description,
4. Publication date,
3. Use the defined extraction rules to process all
blog pages.
8
Locate in HTML page all RSS referenced elements
9
Generic procedure to build extraction rules
10
Extraction rules and string similarity
• Rules are XPath queries.
• For each rule, we compute the score based on string similarity.
• The choice of ScoreFunction greatly influences the running time
and precision of the extraction process.
• Why we chose Sorensen–Dice coefficient similarity:
1. Low sensitivity to word ordering and length variations
2. Runs in linear time
11
Example: blog post title best extraction rule
• RSS feed: http://vbanos.gr/en/feed/
• Find RSS blog post title: “volumelaser.eim.gr” in html page:
http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/
12
XPath HTML Element Value Similarity
Score
/body/div[@id=“page”]/header
/h1
volumelaser.eim.gr 100%
/body/div[@id=“page”]/div[@cla
ss=“entry-code”]/p/a
http://volumelaser.eim.gr/ 80%
/head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66%
... ... ...
• The Best Extraction Rule for the blog post title is:
/body/div[@id=“page”]/header/h1
Time complexity and linear reformulation
13
Post-order traversal of
the HTML tree.
Compute node bigrams
from their children bigrams.
Variations for authors, dates,
comments
• Authors, dates and comments are special cases as
they appear many times throughout a post.
• To resolve this issue, we implement an extra
component in the Score function:
– For authors: an HTML tree distance between the
evaluated node and the post content node.
– For dates: we check the alternative formats of each
date in addition to the HTML tree distance between
the evaluated node and the post content node.
• Example: “1970-01-01” == “January 1 1970”
– For comments: we use the special comment RSS feed.
14
System Architecture
• Our crawler is built on top of Scrapy (http://www.scrapy.org/)
15
System Architecture
• Pipeline of operations:
1. Render HTML and JavaScript,
2. Extract content,
3. Extract comments,
4. Download multimedia files,
5. Propagate resulting records to the back-end.
• Interesting areas:
– Blog post page identification,
– Handle blogs with a large number of pages,
– JavaScript rendering,
– Scalability.
16
Blog post identification
• The crawler visits all blog pages.
• For each URL, it needs to identify whether it is
a blog post or not.
• We construct a regular expression based on
blog post RSS to identify blog posts.
• We assume that all posts from the same blog
use the same URL pattern.
• This assumption is valid for all blog platforms
we have encountered.
17
Handle blogs with a large number of pages
• Avoid random walk of pages, depth first search or
breadth first search.
• Use a priority queue with machine learning defined
priorities.
• Pages with a lot of blog post URLs have a higher
priority.
• Use Distance-Weighted kNN classifier to predict.
– Whenever a new page is downloaded, it is given to the
machine learning system as training data.
– When the crawler encounters a new URL, it will ask the
machine learning system for the potential number of blog
posts and use the value as the download priority of the
URL.
18
JavaScript rendering
• JavaScript is widely used client-side language.
• Traditional HTML based crawlers do not see web pages
using JavaScript.
• We embed PhantomJS, a headless web browser with
great performance and scripting capabilities.
• We instruct the PhantomJS browser to click dynamic
JavaScript pagination buttons on pages to retrieve
more content (e.g. Disqus Show More button to show
comments).
• This crawler functionality is non-generic and requires
human intervention to maintain and extend to other
cases.
19
Scalability
• When aiming to work with a large amount of
input, it is crucial to build every system layer
with scalability in mind.
• The two core crawler procedures NewCrawl
and UpdateCrawl are Stateless and Purely
Functional.
• All shared mutable state is delegated to the
back-end.
20
Evaluation
• Task:
– Extract articles and titles from web pages
• Comparison against three open-source projects:
– Readability (Javascript),
– Boilerpipe (Java),
– Goose (Scala).
• Criteria:
– Extraction success rate,
– Running time.
• Dataset:
– 2300 blog posts from 230 blogs obtained by the Spinn3r dataset.
• System:
– Debian linux 7.2, Intel Core i7-3770 3.4 GHz.
• All data, scripts and instructions to reproduce available at:
– https://github.com/OlivierBlanvillain/blogforever-crawler-publication
21
Evaluation: Extraction success rates
• BlogForever Crawler competitors are generic:
– They do not use RSS feeds.
– They do not use structural similarities between
web pages.
– They can be used with any HTML page.
22
Evaluation: Running time
• our approach spends the majority of its total running time between
the initialisation and the processing of the first blog post.
23
Issues & Future Work
• Our main causes of failure was:
– The insufficient quality of web feeds,
– The high structural variation of blog pages in the same
blog.
• Future Work
– Investigate hybrid extraction algorithms. Combine
with other techniques such as word density or spacial
reasoning.
– Large scale deployment of the software in a
distributed architecture.
24
Thank you!
BlogForever Crawler: Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
• Contact email: vbanos@gmail.com
• Project code available at:
– https://github.com/BlogForever/crawler
25

Contenu connexe

Tendances

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
Mayur Garg
 
Crawler-Friendly Web Servers
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Servers
webhostingguy
 
Cross Domain Hijacking - File Upload Vulnerability
Cross Domain Hijacking - File Upload VulnerabilityCross Domain Hijacking - File Upload Vulnerability
Cross Domain Hijacking - File Upload Vulnerability
Ronan Dunne, CEH, SSCP
 
Upgrade webinar
Upgrade webinarUpgrade webinar
Upgrade webinar
ShanesCows
 

Tendances (16)

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Coding for a wget based Web Crawler
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Crawler-Friendly Web Servers
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Servers
 
REST Methodologies
REST MethodologiesREST Methodologies
REST Methodologies
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
Website Mashup
Website MashupWebsite Mashup
Website Mashup
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlusLeveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it work
 
Website optimization with request reduce
Website optimization with request reduceWebsite optimization with request reduce
Website optimization with request reduce
 
What is REST?
What is REST?What is REST?
What is REST?
 
Cross Domain Hijacking - File Upload Vulnerability
Cross Domain Hijacking - File Upload VulnerabilityCross Domain Hijacking - File Upload Vulnerability
Cross Domain Hijacking - File Upload Vulnerability
 
Upgrade webinar
Upgrade webinarUpgrade webinar
Upgrade webinar
 

En vedette

Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_intern
Sai Ganesh
 

En vedette (14)

Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναΥπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
 
BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013
 
Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!
 
Presentationjava
PresentationjavaPresentationjava
Presentationjava
 
Price Comparison in turkey
Price Comparison in turkeyPrice Comparison in turkey
Price Comparison in turkey
 
Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_intern
 
Prolog Visualizer
Prolog VisualizerProlog Visualizer
Prolog Visualizer
 
Synopsis on Smart price
Synopsis on Smart priceSynopsis on Smart price
Synopsis on Smart price
 
How Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparisonHow Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparison
 
Comparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing toolsComparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing tools
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Price Comparison Sites in Indonesia 2014
Price Comparison Sites in Indonesia 2014Price Comparison Sites in Indonesia 2014
Price Comparison Sites in Indonesia 2014
 
Comparision between online & offline marketing
Comparision between online & offline marketingComparision between online & offline marketing
Comparision between online & offline marketing
 

Similaire à BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
mak57
 
High performance website
High performance websiteHigh performance website
High performance website
Chamnap Chhorn
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato
 
Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)
kiwiboris
 

Similaire à BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14 (20)

JS digest. Decemebr 2017
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017
 
Aem offline content
Aem offline contentAem offline content
Aem offline content
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Analysis of Google Page Speed Insight
Analysis of Google Page Speed InsightAnalysis of Google Page Speed Insight
Analysis of Google Page Speed Insight
 
Spring Framework 3.2 - What's New
Spring Framework 3.2 - What's NewSpring Framework 3.2 - What's New
Spring Framework 3.2 - What's New
 
Web Crawlers
Web CrawlersWeb Crawlers
Web Crawlers
 
Bootstrap [part 1]
Bootstrap [part 1]Bootstrap [part 1]
Bootstrap [part 1]
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Arcomem training system-overview_advanced
Arcomem training system-overview_advancedArcomem training system-overview_advanced
Arcomem training system-overview_advanced
 
Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1
 
Best Practices for Building WordPress Applications
Best Practices for Building WordPress ApplicationsBest Practices for Building WordPress Applications
Best Practices for Building WordPress Applications
 
High performance website
High performance websiteHigh performance website
High performance website
 
Web performance
Web performanceWeb performance
Web performance
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Salesforce Performance hacks - Client Side
Salesforce Performance hacks - Client SideSalesforce Performance hacks - Client Side
Salesforce Performance hacks - Client Side
 
Architecture Patterns - Open Discussion
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
 
Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)
 
Migrating very large site collections
Migrating very large site collectionsMigrating very large site collections
Migrating very large site collections
 

Plus de Vangelis Banos

The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
Vangelis Banos
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
Vangelis Banos
 

Plus de Vangelis Banos (10)

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
 
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!
 
The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
 
ΥπερΔιαύγεια
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγεια
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

  • 1. BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3 1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland, 2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland, 3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece 1
  • 2. Contents • Introduction: – The disappearing web, – Blog archiving. • Our Contributions • Algorithms – Motivation, – Blog content extraction, – Extraction rules, – Variations for authors, dates and comments. • System Architecture • Evaluation – Comparison with 3 web article extraction systems. • Issues and Future Work 2
  • 3. The disappearing web Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/ 3
  • 4. Blog archiving 1. Why archive the web? – Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. 2. Blog archiving is a special case of web archiving. 3. The blogosphere is a live record of contemporary Society, Culture, Science and Economy. 4. Some blogs contain unique data and valuable information. – Users take action and make important decisions based on this information. 5. We have a Responsibility to preserve the web. 4
  • 5. 5 Blog crawlers  Real-time monitoring  Html data extraction engine  Spam filtering Unstructured information Original data and XML metadata Blog digital repository  Digital preservation  Quality assurance  Collections curation  Public access APIs  Personalised services  Information retrieval  Public web interface / Browse, search, export Harvesting PreservingManaging and reusing Web services Web interface FP7 EC Funded Project http://blogforever.eu/
  • 6. Our Contributions • A web crawler capable of extracting blog articles, authors, publication dates and comments. • A new algorithm to build extraction rules from blog web feeds with linear time complexity, • Applications of the algorithm to extract authors, publication dates and comments, • A new web crawler architecture, including how we use a complete web browser to render JavaScript web pages before processing them. • Extensive evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms. 6
  • 7. Motivation • Extracting metadata and content from HTML documents is a challenging task. – Web standards usage is low (<0.5% of websites). – More than 95% of websites do not pass HTML validation. • Having blogs as our target websites, we made the following observations which play a central role in the extraction process: a) Blogs provide web feeds: structured and standardized XML views of the latest posts of a blog, b) Posts of the same blog share a similar HTML structure. c) Web feeds usually have 10-20 posts whereas blogs contain a lot more. We have to access more posts than the ones referenced in web feeds. 7
  • 8. Content Extraction Overview 1. Use blog web feeds and referenced HTML pages as training data to build extraction rules. 2. Extraction rules capable of locating in HTML page all RSS referenced elements such as: 1. Title, 2. Author, 3. Description, 4. Publication date, 3. Use the defined extraction rules to process all blog pages. 8
  • 9. Locate in HTML page all RSS referenced elements 9
  • 10. Generic procedure to build extraction rules 10
  • 11. Extraction rules and string similarity • Rules are XPath queries. • For each rule, we compute the score based on string similarity. • The choice of ScoreFunction greatly influences the running time and precision of the extraction process. • Why we chose Sorensen–Dice coefficient similarity: 1. Low sensitivity to word ordering and length variations 2. Runs in linear time 11
  • 12. Example: blog post title best extraction rule • RSS feed: http://vbanos.gr/en/feed/ • Find RSS blog post title: “volumelaser.eim.gr” in html page: http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/ 12 XPath HTML Element Value Similarity Score /body/div[@id=“page”]/header /h1 volumelaser.eim.gr 100% /body/div[@id=“page”]/div[@cla ss=“entry-code”]/p/a http://volumelaser.eim.gr/ 80% /head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66% ... ... ... • The Best Extraction Rule for the blog post title is: /body/div[@id=“page”]/header/h1
  • 13. Time complexity and linear reformulation 13 Post-order traversal of the HTML tree. Compute node bigrams from their children bigrams.
  • 14. Variations for authors, dates, comments • Authors, dates and comments are special cases as they appear many times throughout a post. • To resolve this issue, we implement an extra component in the Score function: – For authors: an HTML tree distance between the evaluated node and the post content node. – For dates: we check the alternative formats of each date in addition to the HTML tree distance between the evaluated node and the post content node. • Example: “1970-01-01” == “January 1 1970” – For comments: we use the special comment RSS feed. 14
  • 15. System Architecture • Our crawler is built on top of Scrapy (http://www.scrapy.org/) 15
  • 16. System Architecture • Pipeline of operations: 1. Render HTML and JavaScript, 2. Extract content, 3. Extract comments, 4. Download multimedia files, 5. Propagate resulting records to the back-end. • Interesting areas: – Blog post page identification, – Handle blogs with a large number of pages, – JavaScript rendering, – Scalability. 16
  • 17. Blog post identification • The crawler visits all blog pages. • For each URL, it needs to identify whether it is a blog post or not. • We construct a regular expression based on blog post RSS to identify blog posts. • We assume that all posts from the same blog use the same URL pattern. • This assumption is valid for all blog platforms we have encountered. 17
  • 18. Handle blogs with a large number of pages • Avoid random walk of pages, depth first search or breadth first search. • Use a priority queue with machine learning defined priorities. • Pages with a lot of blog post URLs have a higher priority. • Use Distance-Weighted kNN classifier to predict. – Whenever a new page is downloaded, it is given to the machine learning system as training data. – When the crawler encounters a new URL, it will ask the machine learning system for the potential number of blog posts and use the value as the download priority of the URL. 18
  • 19. JavaScript rendering • JavaScript is widely used client-side language. • Traditional HTML based crawlers do not see web pages using JavaScript. • We embed PhantomJS, a headless web browser with great performance and scripting capabilities. • We instruct the PhantomJS browser to click dynamic JavaScript pagination buttons on pages to retrieve more content (e.g. Disqus Show More button to show comments). • This crawler functionality is non-generic and requires human intervention to maintain and extend to other cases. 19
  • 20. Scalability • When aiming to work with a large amount of input, it is crucial to build every system layer with scalability in mind. • The two core crawler procedures NewCrawl and UpdateCrawl are Stateless and Purely Functional. • All shared mutable state is delegated to the back-end. 20
  • 21. Evaluation • Task: – Extract articles and titles from web pages • Comparison against three open-source projects: – Readability (Javascript), – Boilerpipe (Java), – Goose (Scala). • Criteria: – Extraction success rate, – Running time. • Dataset: – 2300 blog posts from 230 blogs obtained by the Spinn3r dataset. • System: – Debian linux 7.2, Intel Core i7-3770 3.4 GHz. • All data, scripts and instructions to reproduce available at: – https://github.com/OlivierBlanvillain/blogforever-crawler-publication 21
  • 22. Evaluation: Extraction success rates • BlogForever Crawler competitors are generic: – They do not use RSS feeds. – They do not use structural similarities between web pages. – They can be used with any HTML page. 22
  • 23. Evaluation: Running time • our approach spends the majority of its total running time between the initialisation and the processing of the first blog post. 23
  • 24. Issues & Future Work • Our main causes of failure was: – The insufficient quality of web feeds, – The high structural variation of blog pages in the same blog. • Future Work – Investigate hybrid extraction algorithms. Combine with other techniques such as word density or spacial reasoning. – Large scale deployment of the software in a distributed architecture. 24
  • 25. Thank you! BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3 1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland, 2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland, 3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece • Contact email: vbanos@gmail.com • Project code available at: – https://github.com/BlogForever/crawler 25

Notes de l'éditeur

  1. Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog. Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog.
  2. One might notice that each best rule computation is independent and operates on a different input pair. This implies that Algorithm 1 is embarrassingly parallel : iterations of the outer loop can trivially be executed on multiple threads. Functions in Algorithm 1 are voluntarily abstract at this point and will be explained in detail in the remaining of this section. Subsection 2.3 defines AllRules, Apply and the ScoreFunction we use for article extraction. In subsection 2.4 we analyse the time complexity of Algorithm 1 and give a linear time reformulation using dynamic programming. Finally, subsection 2.5 shows how the ScoreFunction can be adapted to extract authors, dates and comments.