Blogs are a dynamic communication medium which has been
widely established on the web. The BlogForever project has
developed an innovative system to harvest, preserve, manage
and reuse blog content. This paper presents a key component
of the BlogForever platform, the web crawler. More
precisely, our work concentrates on techniques to automatically
extract content such as articles, authors, dates and
comments from blog posts. To achieve this goal, we introduce
a simple and robust algorithm to generate extraction
rules based on string matching using the blog’s web feed in
conjunction with blog hypertext. This approach leads to a
scalable blog data extraction process. Furthermore, we show
how we integrate a web browser into the web harvesting process
in order to support the data extraction from blogs with
JavaScript generated content.
BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14
1. BlogForever Crawler:
Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
1
2. Contents
• Introduction:
– The disappearing web,
– Blog archiving.
• Our Contributions
• Algorithms
– Motivation,
– Blog content extraction,
– Extraction rules,
– Variations for authors, dates and comments.
• System Architecture
• Evaluation
– Comparison with 3 web article extraction systems.
• Issues and Future Work
2
3. The disappearing web
Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/
3
4. Blog archiving
1. Why archive the web?
– Web archiving is the process of collecting portions of
the World Wide Web to ensure the information
is preserved in an archive for future researchers, historians,
and the public.
2. Blog archiving is a special case of web archiving.
3. The blogosphere is a live record of contemporary Society,
Culture, Science and Economy.
4. Some blogs contain unique data and valuable information.
– Users take action and make important decisions based on
this information.
5. We have a Responsibility to preserve the web.
4
5. 5
Blog crawlers
Real-time monitoring
Html data extraction engine
Spam filtering
Unstructured
information
Original data and
XML metadata
Blog digital repository
Digital preservation
Quality assurance
Collections curation
Public access APIs
Personalised services
Information retrieval
Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web services
Web interface
FP7 EC Funded Project
http://blogforever.eu/
6. Our Contributions
• A web crawler capable of extracting blog articles,
authors, publication dates and comments.
• A new algorithm to build extraction rules from
blog web feeds with linear time complexity,
• Applications of the algorithm to extract authors,
publication dates and comments,
• A new web crawler architecture, including how we
use a complete web browser to render JavaScript
web pages before processing them.
• Extensive evaluation of the content extraction and
execution time of our algorithm against three
state-of-the-art web article extraction algorithms.
6
7. Motivation
• Extracting metadata and content from HTML
documents is a challenging task.
– Web standards usage is low (<0.5% of websites).
– More than 95% of websites do not pass HTML validation.
• Having blogs as our target websites, we made the
following observations which play a central role in the
extraction process:
a) Blogs provide web feeds: structured and standardized
XML views of the latest posts of a blog,
b) Posts of the same blog share a similar HTML structure.
c) Web feeds usually have 10-20 posts whereas blogs
contain a lot more. We have to access more posts than
the ones referenced in web feeds.
7
8. Content Extraction Overview
1. Use blog web feeds and referenced HTML pages
as training data to build extraction rules.
2. Extraction rules capable of locating in HTML
page all RSS referenced elements such as:
1. Title,
2. Author,
3. Description,
4. Publication date,
3. Use the defined extraction rules to process all
blog pages.
8
11. Extraction rules and string similarity
• Rules are XPath queries.
• For each rule, we compute the score based on string similarity.
• The choice of ScoreFunction greatly influences the running time
and precision of the extraction process.
• Why we chose Sorensen–Dice coefficient similarity:
1. Low sensitivity to word ordering and length variations
2. Runs in linear time
11
12. Example: blog post title best extraction rule
• RSS feed: http://vbanos.gr/en/feed/
• Find RSS blog post title: “volumelaser.eim.gr” in html page:
http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/
12
XPath HTML Element Value Similarity
Score
/body/div[@id=“page”]/header
/h1
volumelaser.eim.gr 100%
/body/div[@id=“page”]/div[@cla
ss=“entry-code”]/p/a
http://volumelaser.eim.gr/ 80%
/head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66%
... ... ...
• The Best Extraction Rule for the blog post title is:
/body/div[@id=“page”]/header/h1
13. Time complexity and linear reformulation
13
Post-order traversal of
the HTML tree.
Compute node bigrams
from their children bigrams.
14. Variations for authors, dates,
comments
• Authors, dates and comments are special cases as
they appear many times throughout a post.
• To resolve this issue, we implement an extra
component in the Score function:
– For authors: an HTML tree distance between the
evaluated node and the post content node.
– For dates: we check the alternative formats of each
date in addition to the HTML tree distance between
the evaluated node and the post content node.
• Example: “1970-01-01” == “January 1 1970”
– For comments: we use the special comment RSS feed.
14
16. System Architecture
• Pipeline of operations:
1. Render HTML and JavaScript,
2. Extract content,
3. Extract comments,
4. Download multimedia files,
5. Propagate resulting records to the back-end.
• Interesting areas:
– Blog post page identification,
– Handle blogs with a large number of pages,
– JavaScript rendering,
– Scalability.
16
17. Blog post identification
• The crawler visits all blog pages.
• For each URL, it needs to identify whether it is
a blog post or not.
• We construct a regular expression based on
blog post RSS to identify blog posts.
• We assume that all posts from the same blog
use the same URL pattern.
• This assumption is valid for all blog platforms
we have encountered.
17
18. Handle blogs with a large number of pages
• Avoid random walk of pages, depth first search or
breadth first search.
• Use a priority queue with machine learning defined
priorities.
• Pages with a lot of blog post URLs have a higher
priority.
• Use Distance-Weighted kNN classifier to predict.
– Whenever a new page is downloaded, it is given to the
machine learning system as training data.
– When the crawler encounters a new URL, it will ask the
machine learning system for the potential number of blog
posts and use the value as the download priority of the
URL.
18
19. JavaScript rendering
• JavaScript is widely used client-side language.
• Traditional HTML based crawlers do not see web pages
using JavaScript.
• We embed PhantomJS, a headless web browser with
great performance and scripting capabilities.
• We instruct the PhantomJS browser to click dynamic
JavaScript pagination buttons on pages to retrieve
more content (e.g. Disqus Show More button to show
comments).
• This crawler functionality is non-generic and requires
human intervention to maintain and extend to other
cases.
19
20. Scalability
• When aiming to work with a large amount of
input, it is crucial to build every system layer
with scalability in mind.
• The two core crawler procedures NewCrawl
and UpdateCrawl are Stateless and Purely
Functional.
• All shared mutable state is delegated to the
back-end.
20
21. Evaluation
• Task:
– Extract articles and titles from web pages
• Comparison against three open-source projects:
– Readability (Javascript),
– Boilerpipe (Java),
– Goose (Scala).
• Criteria:
– Extraction success rate,
– Running time.
• Dataset:
– 2300 blog posts from 230 blogs obtained by the Spinn3r dataset.
• System:
– Debian linux 7.2, Intel Core i7-3770 3.4 GHz.
• All data, scripts and instructions to reproduce available at:
– https://github.com/OlivierBlanvillain/blogforever-crawler-publication
21
22. Evaluation: Extraction success rates
• BlogForever Crawler competitors are generic:
– They do not use RSS feeds.
– They do not use structural similarities between
web pages.
– They can be used with any HTML page.
22
23. Evaluation: Running time
• our approach spends the majority of its total running time between
the initialisation and the processing of the first blog post.
23
24. Issues & Future Work
• Our main causes of failure was:
– The insufficient quality of web feeds,
– The high structural variation of blog pages in the same
blog.
• Future Work
– Investigate hybrid extraction algorithms. Combine
with other techniques such as word density or spacial
reasoning.
– Large scale deployment of the software in a
distributed architecture.
24
25. Thank you!
BlogForever Crawler: Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
• Contact email: vbanos@gmail.com
• Project code available at:
– https://github.com/BlogForever/crawler
25
Notes de l'éditeur
Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog.
Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog.
One might notice that each best rule computation is independent
and operates on a different input pair. This implies
that Algorithm 1 is embarrassingly parallel : iterations of the
outer loop can trivially be executed on multiple threads.
Functions in Algorithm 1 are voluntarily abstract at this
point and will be explained in detail in the remaining of
this section. Subsection 2.3 defines AllRules, Apply and
the ScoreFunction we use for article extraction. In subsection
2.4 we analyse the time complexity of Algorithm 1
and give a linear time reformulation using dynamic programming.
Finally, subsection 2.5 shows how the ScoreFunction
can be adapted to extract authors, dates and comments.