This document discusses techniques for discovering structured information from web sites. It presents three main contributions:
1. A method to extract structured data in the form of web lists that are split across multiple web pages, called logical lists.
2. An approach for automatically extracting sitemaps from web sites.
3. A technique for clustering web pages based on intra-page and extra-page features.
1. UNIBA: http://www.uniba.it DIB: http://www.di.uniba.it KDDE: http://kdde.di.uniba.it
Discovering Structured Information from
Web Sites
PhD Candidate Fabiana Lanotte,
Supervisor Michelangelo Ceci
2. Web. . . From homogeneous information
network
● Homogeneous nodes (i.e.
hypertextual documents) and
relations (i.e. hyperlinks)
● Nodes content is encoded in
HTML
● Nodes content is unstructured
3. Web. . . From homogeneous information
network
HTML is a markup language:
● A description for a browser to
render
● describes how the data
should be displayed
● It never meant to describe
the data.
4. Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
5. Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
● Represent different kinds of real
world entities
6. Web. . . To heterogeneous information
network
However the Web is full of data:
● Structured in some way
● Represent different kinds of
real world entities
● Interact via various kinds
of relationships. staff
professor
course
news
professor
papers
Provided
course
7. What we could do?
Search
● Show structured information in response to query
● Automatically rank and cluster web pages
● Reasoning on the Web
○ Who are the people at some company? What are the courses in some
college department?
Analysis
● Expand the known information of an entity
○ What is a professor’s phone number, email, courses taught, research,
etc?
8. Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
9. Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
2. Automatic extraction of sitemaps
10. Contributions of this thesis
1. Extract structured data in form of web lists splitted on
multiple web pages: Logical list
2. Automatic extraction of sitemaps
3. Clustering of web pages based on their intra and extra
page features
11. Automatic Extraction of Logical Web
Lists
Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic
Extraction of Logical Web Lists. ISMIS 2014: 365-374
12. Introduction
A large amount of structured data on the Web
exists in several forms:
– HTML lists, tables, and back-end Deep Web databases
13. Table tags, list tags and much more
● Cafarella et al. [1] estimated that there are
more than one billion of relational data,
expressed using the HTML table tag;
● Elmeleegy et al.[2] suggested an equal
number from HTML lists;
… but many structured data are not
represented with table or list tags (e.g. BBC
news).
[1] M.J.Cafarella, A. Halevy, and J. Madhavan. Structured data on the web.
Commun.ACM, Feb.2011
[2] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on
the web. VLDB Journal, Apr.2011
14. Web Pages as Database views
• Moreover, many websites (e.g. Amazon, Trulia, AbeBooks,
etc.), present their listings in multiple web pages;
• Similar to a database:
– The whole listing can represent a Database table (i.e.
Logical list);
– Each list contained in a Web Page can represent a view
of a logical list.
15. Web Pages as Database views
• Logical list: listing of all
computer products;
• View list: top six products;
• There are web lists that
– allow us to extend
a view list,
– filter a logical list.
16. Goal
Our goal is to define a novel unsupervised
and domain independent approach that is
able to:
• Identify web lists from a Web Page, that are not
necessarily represented as HTML lists and tables;
• Extract logical lists.
• Can be used in several domain application ( Entity
Pages discovery, query answering, etc.)
17. Related Works
• Existing approaches for web lists extraction can be
classified in:
– Structural methods: focus on the automatically extraction
rules, using common DOM-structures. They often fail to handle
more complicated or noisy structures (e.g. Road runner);
– Visual methods: use visual information from rendered Web
Pages (e.g. ViDE);
– Hybrid methods: combine structural and visual features (e.g.
Hylien, Ventex).
18. Related Works
• Existing approaches focus only on the extraction of
Web lists in a single Web Page, and fail to detect
logical lists.
• Furche et al. presented a supervised method, based
on structural and visual features, to extract pagination
lists from Web Pages:
– The accuracy of the model depends strongly on the size of the
training set and on the complexity of feature model used.
– It is not applicable at a Web scale.
19. Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags;
• Visual: composed by rendering information;
20. Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Visual: composed by the Rendered Box Tree;
HTML tag,
(x, y, h, w)
21. Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags
• Visual: composed by rendering information;
22. Definitions
A web page is characterized by multiple representations:
• Textual: composed by web page terms;
• Structural: composed by HTML tags
• Visual: composed by rendering information;
•
23. Definitions
Web List:
• Collection of two or more web
elements (called data record)
codified as rendered boxes
having a similar HTML
structure and visually adjacent
and aligned.
25. Methodology
Three-steps strategy
1. Web Lists extraction: Given a web page P extract the set
L = {l1
, l2
,...ln
} of web lists contained in P;
2. Dominant list identification
– Dominant list: list of interest, containing the data
records of the logical list we want extract;
3. Logical List Discovery;
26. Methodology
2. Dominant list identification
The set L = {l1
, l2
,...ln
} is used to extract the dominant list li
27. Methodology
3. Logical List discovery
• The dominant list li
is used to discovery the logical list LL
containing li
• Idea: links that are grouped in collection with an uniform
layout and presentation usually lead to similar pages;
• Two lists belong to a same Logical List LL if their
elements satisfy a structural similarity (e.g. Levenshtein
distance, normalized tree edit distance) and visual
similarity.
28. Experiments
• Dataset: 66061 list elements manually extracted from
4405 Web pages belonging to 40 different websites
(music shops, web journals, movies information,
home listings, computer accessories, etc.);
• For the deep-web databases, we performed a query
for each of them and collected the first page of the
results list, and for others we manually select a Web
page;
• Used measure: Precision, Recall, F-measure.
29. Results
– In average, it achieves 100% Precision,
95% Recall and 97% F-Measure;
– There are no False positive;
– The presence of false negative is caused by
high variance of Data Record;
30. Conclusions and Future Works
• Our method solves the problem of
– discovering data records organized in web lists,
– extracting logical lists
• The experimental results show that our algorithm is able
to discover logical lists in a heterogeneous set of Web
sites (i.e. domain independent);
• The logical list extraction can be used to improve the
extraction of Entity Pages (e.g. Web pages of books,
professors, etc.), improve query answering using lists and
disambiguation.
31. Automatic Generation of Sitemaps
Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation
of Sitemaps Based on Navigation Systems. MOD 2016: 216-223
Pasqua Fabiana Lanotte, M. Ceci. Closed Sequential Pattern Mining for Sitemap
Generation. Ready to be submitted at TKDE
32. Automatic generation of Sitemaps
• CloFAST: Closed Sequential Pattern Mining
using Sparse and Vertical Id-Lists
• Generation of sitemap through navigation
systems
33. CloFAST: Closed Sequential Pattern
Mining using Sparse and Vertical Id-Lists
F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed
sequential pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2):
429-463 (2016)
34. Sequential Pattern Mining
• Sequential pattern mining provides approaches and
techniques to mine knowledge from sequence data.
• Given a sequence database and a support threshold,
find the complete list of frequent subsequences.
• Example: customer who buy a Canon camera is likely to buy an HP
color printer within a month.
Given the support threshold
minSupp=2 <(ab)c> is a frequent
sequential pattern
35. State of art: limitations
• There are different types of patterns: frequent, closed,
maximal, with constraints, etc.
• However, the main limits of the state of the art algorithms are:
1. The need of multiple scans of the dataset
2. The generation of huge set of candidate sequences
3. The inefficiency in handling very long sequences
4. (All result in) the inefficiency in handling massive data
• Our solution is the CloFAST algorithm for closed sequential
pattern mining
36. Problem Definition
Let I = {i1
, i2
, …, in
} be a set of different items:
•An itemset is a non-empty subset of items denoted as (i1, i2,…, ik).
•A sequence is an ordered list of elements denoted as
‹s1
→ s2
→ … → sm
› where si
can be an itemset or a single item.
•A sequence α = ‹a1
→ a2
→ … → an
› is called a sub-sequence of another
sequence β = ‹b1
→ b2
→ … → bm
›, and β a super-sequence of α, if there
exist integers 1 ≤ j1
≤ j2
≤ … ≤ jn
≤ m such that a1
⊆ bj1
, a2
⊆ bj2
and an
⊆
bjn
.
Example
• sequence A → BC is a sub-sequence of the sequence AB → E → BCD,
• AC → AD is not a sub-sequence of A → C → AD.
37. Problem Definition
• A sequence database D is a set of tuples (sid, S), where sid is
a sequence id and S is a sequence.
• The support of a sequence α in D, denoted as σ(α), is the
number of tuples containing α.
• Given a user defined threshold δ, a sequence α is said frequent
in D if σ(α) ≥ δ.
• Given two sequences β and α, if β is a super-sequence of α and
their support is equal, we say that β is closed and absorbs α.
38. CloFAST
• CloFAST combines:
– Two data structures to support search space exploration
for closed itemsets and sequences
– A new data representation of sequence databases which
is based on sparse id-lists and vertical id-lists
– A novel one-step technique to both check backward
closure and to prune the search space
• CloFAST works in two steps:
– Closed itemset mining using sparse id-lists
– Closed sequence discovery using vertical id-lists
39. Closed Itemset Enumeration Tree
(CIET)
• It enumerates the complete set of closed frequent
itemsets.
– The root node is the empty set,
– The first level enumerates the 1-length-items according the a
lexicographic order,
– The other levels store the itemsets with length > 1
• Each node in the CIET can be labeled as: intermediate,
unpromising and closed.
40. CIET: example
• Nodes with heavy-line
borders represent closed
itemsets.
• Nodes with dashed-line
borders represent
unpromising nodes.
• Remaining nodes
represent intermediate
nodes.
41. Closed Sequence Enumeration
Tree (CSET)
• It enumerates the complete search space of
closed sequences.
• It has the following properties:
1. Each node corresponds to a sequence
2. If a node corresponds to a sequence s, its children
are obtained by a sequence extension of s
42. CSET: Example
• Nodes with continuous-line borders represent closed sequences.
• Nodes with dashed-line borders represent pruned nodes.
43. Sparse Id-List
Given an itemset t, its sparse id-list (SILt
) is a
vector of size n (|D|=n), such that for each j=1,…,n
if Sj
contains t
otherwise
• It can be efficiently used to support itemset-extension
44. Sparse Id-List: Example
Sequence_ID Sequence – Transaction Id
1 2 3 4 5 6
1 a (abc) (ac) d (cf)
2 (ad) c (bc) (ae)
3 (ef) (ab) (df) c b
4 e g (af) c b c
SILa
SILb
SILc SILd
SILe
SILf
45. Itemset-extension using SILs
• Itemset-extension (I-Step): an item is added to the last
transaction of a sequence
• In CloFAST, itemset extensions are executed during the
construction of the CIET.
• I-Step is achieved by iterating on SILa
and SILb
elements,
and returning a news SIL(a,b)
with the transactions-ids
which are found in both the SILs for the same id.
46. Vertical Id-List
Given a sequence α, whose last itemset is i, its
vertical id-list (VILα
) is a vector of size n (|D|=n),
such that for each j=1,…,n
if Sj
contains α
otherwise
It can be efficiently used to support sequence-extension
47. Vertical Id-List: Example
SILa
SILb
SILc SILd
SILe
SILf
VILa
VILa
VILb
VILc
VILd
VILe
VILf
Sequence_ID Sequence
1 a (abc) (ac) d (cf)
2 (ad) c (bc) (ae)
3 (ef) (ab) (df) c b
4 e g (af) c b c
48. Algorithm
1. With a first database scan, CloFAST finds frequent 1-length itemsets
and builds their sparse id-lists.
2. It mines closed frequent itemsets using a CIET (to explore the search
space) and itemset extension to verify the support the discovered
itemsets.
3. The closed frequent itemsets are used to fill the first level of a CSET.
4. It discovers closed sequential patterns by doing sequence extension
on candidates generated from the CSET. For each candidate:
– It is computed its VIL and its support
– It is evaluated for sequence closure and pruning
5. Finally the list of closed sequential patterns is returned
49. Backward Closure and Pruning
• It is inspired from BIDE1
bidirectional checking.
• The intuition is that: “it is useless to further explore a node if this node and its
descendants can be absorbed by a node stored in another path of the Tree”.
• It is divided into:
– Forward Closure: this closure is performed by exploring the search space during
S-Step.
– Backward Closure: this is done by using the CSET and alternating Itemset closure
to sequence closure checking for node that are not closed in forward.
• The pruning allows us to not exploring nodes that have the same search
space ( it uses their VILs as marker).
1. Jianyong Wang; Jiawei Han; Chun Li, "Frequent Closed Sequence Mining without Candidate Maintenance," Knowledge
and Data Engineering, IEEE Transactions on , vol.19, no.8, pp.1042,1056, Aug. 2007
50. Backward Closure and Pruning:
Example
• Example of itemset closure for the sequence
α=<a → d>
• α isn’t explored since β has the same VIL
51. Experiments
• Dataset: both synthetics and reals
• Competitors: CloSpan, Bide, ClaSP, Fast
• Evaluation:
– Efficiency (varying dataset density)
– Scalability (varying dataset size)
– Effectiveness of CloFAST optimization technique
54. Results
• Performance study shows that CloFAST significantly
outperforms the state of the art algorithms;
• This is more evident in the case of :
– dense datasets;
– for small values of the support threshold;
– datasets having frequent long sequences;
56. Sitemaps
• A sitemap represents an explicit specification of the design
concept and knowledge organization of a website.
• It helps users and search-bot by providing a hierarchical
overview of a website’s content:
– to increase the user experience of the website;
– to provide a complementary tool to the keyword-based search for the
information retrieval process.
“one of the oldest hypertext usability principles is to offer a visual representation of
the information space in order to help users understand where they can go”
Jacob Nielsen
57. State of the Art
• Before the Google Sitemap Protocol (2005), sitemaps
were mostly manually generated
– As websites got bigger it was hard to keep sitemaps
updated, missing contents (e.g. blogs, forum)
– Most existing tools extract a list of urls and do not output
the hierarchical structure of websites; 1
• It is not a simple process, especially for websites with
a great amount of content
1. http://slickplan.com/, http://www.screamingfrog.co.uk, https://www.xml-sitemaps.com/
58. State of the Art
• The most prominent works are mainly based on the textual
content of the web pages
• HDTM (Weninger et al. CIKM 2012) proposed:
1. Uses random walks with restart from homepage to sample a distribution
to assign to each website’s page.
2. An iterative algorithm based on Gibbs sampling discovers hierarchies.
Final hierarchies are obtained selecting, for each node, the most probable
parent
– Limitations: ineffective in at least two cases: 1) when there is not enough
information in the text of a page; 2) when web pages have different
content, but actually refer to the same semantic class.
59. State of the Art
• Other works are based on analysis hyperlinks, urls
structure and heuristics (or a combination of these
features)
• limitations:
– Using of supervised solutions
– Assume websites are homogeneous
– Unable to recover the original hierarchical organization
codified in embedded navigational systems
60. Automatic Generation of Sitemaps
• In this work we present a solution to extract
deeper sitemaps by using the website
Navigation System:
– It provides a local view of the website
hierarchy
• Idea: combine this local view to extract
Sitemaps that describe the global view of the
website hierarchy
61. Proposed Solution
We develop an unsupervised and domain-independent
method which is able to discover the hierarchical
structure of a website using frequent navigation paths
through web lists
62. Sitemap Definition
Given:
• A web graph G(V,E) rooted at the homepage h ∈ V;
• A sequence database which enumerates a subset of all possible path in G
starting from h and having length t ∈ ℕ
• A threshold k ∈ ℕ
A sitemap is a tree T(V’, E’) ⊆ G such that:
• V’⊆V, E’ ⊆ E
• ∀ (i, j) ∈ E’ | j ∈ weblist(i)
• ∀ e = (i, j) ∈ E’, ∃ w: E -> ℕ such that, α = pathT
(<h,...i,j>) and w(e) = σ (α
) and w(e)>k
63. Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
1
64. Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
2
65. Methodology
The algorithm employs a three-step strategy:
1. Sequence dataset generation (using Random Walks);
2. Sequential pattern mining phase (CloFAST);
3. Post-pruning.
3
71. Results
• Models based on the sequential pattern mining
approach outperform the competitor HDTM
• Comparisons are based on web page sitemaps which
are poor of contents. This implies that generated
models have high precision but low recall.
72. Exploiting Websites Structural and Content
Features for Web Pages Clustering
Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites
Structural and Content Features for Web Pages Clustering. ISMIS 2017.
73. Introduction
• The process of automatically organizing web pages and
websites has always been an important task in Web Mining
• Since a web page is characterized by several representations
the existing clustering algorithms differ in their ability of
using these representations.
• Most of existing works analyze these features almost
independently, mainly because different sources of
information uses different data representations.
74. Related Works
• Clustering based on textual representation:
– Consider web pages as plain texts
– Turn to be ineffective when there is not enough
information in the text or when they have different
content, but actually refer to the same semantic class.
75. Related Works
• Clustering based on HTML structure representation:
– typically consider the HTML formatting (i.e. HTML tags
and visual information rendered by a web browser)
– Idea: similar web pages are generated using a common
HTML template
76. Related Works
• Clustering based on the hyperlink structure:
– Web pages are nodes in a graph where hyperlinks enable
the information to be split in multiple and
interdependent nodes.
– Clustering is based on the analysis of the topological
structure of a network
77. Goal
Performing clustering of web pages in a website by combining
information about content, web page structure and hyperlink
structure of web pages.
Idea: two web pages are similar if they have common terms (i.e.
Bag of words hypothesis) and they share the same reachability
properties in the website’s graph (i.e. Distributional hypothesis).
78. Methodology
The proposed solution implements a four steps strategy:
1. Website crawling;
2. Link vectors generation;
3. Content vectors generation;
4. Content-Link coupled Clustering
79. Methodology
1. Website crawling:
Crawling uses web pages’ structure
information and exploits web lists in
order to:
• Mitigate problems coming from noisy
links which may not be relevant to
clustering process (advertisement links,
short-cut, etc.)
• Include only links belonging to the
navigational system
79
80. Methodology
2. Link vectors generation
- Starting from the homepage we apply the random walk
theory to extract a sequence database of random walks
- Inspired by the field of IR, we consider each random walk a
sentence and eac web page a word
81. Methodology
2. Link vectors generation
- Starting from the homepage we apply the random walk
theory to extract a sequence database of random walks
- Inspired by the field of IR, we consider each random walk a
sentence and each web page a word
- We can apply distributional-based algorithms (e.g.
Word2Vec) to extract for each word a vector space
representation
[0.4, 1.2, 3.1, ...]
[1.3, 0.4, 4.9, ...]
82. Methodology
3. Content vectors generation
- We consider web pages as plain texts (i.e. bag of word
hypotheses)
- We apply the traditional TF-IDF weighting schema to obtain a
content-vector representation
4. Content-Link coupled Clustering
- Normalize both content and link vector for their euclidean norm
- Concatenate content and link vectors of each web page
- Generated vectors can be used in traditional clustering algorithms
based on vector space model.
82
84. Experiments
• Research questions:
1. Which is the real contribution of combining
content and hyperlink structure in a single vector
space representation?
2. Which is the the role of using web lists to reduce
noise and improve clustering results?
84
87. Results
• The best results are obtained combining textual
information with hyperlink structure.
• Results do not show a statistical contribution in the
use of web lists for clustering purpose
87
88. Publications
• Pasqua Fabiana Lanotte, F. Fumarola, M. Ceci, D. Malerba, Automatic Extraction of
Logical Web Lists. ISMIS 2014: 365-374
• G. Pio, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. Mining Temporal Evolution of
Entities in a Stream of Textual Documents. ISMIS 2014: 50-60
• M. Ceci, Pasqua Fabiana Lanotte, F. Fumarola, D. Cavallo, D. Malerba. Completion Time
and Next Activity Prediction of Processes Using Sequential Pattern Mining. Discovery
Science 2014: 49-61
• Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Automatic Generation of
Sitemaps Based on Navigation Systems. MOD 2016: 216-223
• F. Fumarola, Pasqua Fabiana Lanotte, M. Ceci, D. Malerba. CloFAST: closed sequential
pattern mining using sparse and vertical id-lists. Knowl. Inf. Syst.48(2): 429-463 (2016)
• Pasqua Fabiana Lanotte, F. Fumarola, D. Malerba, M. Ceci. Exploiting Web Sites
Structural and Content Features for Web Pages Clustering. ISMIS 2017.
88
90. Sequence-extension using VILs
Given two sibling nodes α, β in the CSET,
– The sequence extension (S-Step) of α using β aims at generate a new
sequence γ.
– γ is obtained by adding to α the last itemset in β.
Example:
‹ e → a› -- S-Step -- <e → d> = ‹ e → a → d›
•S-Step is achieved by:
1. verifying for each position j of VILα
and VILβ
the condition that the value of
VILα
[j] is lower than VILβ
[j].
2. And/or using a shift-right function to move to the next transaction id of the
last itemset in β that has a value greater than the value of the transaction id
of the last itemset in α
90
91. Example of a S-Step
S-step on sequence b -> a
VILa
VILb
S-Step
SILa
shift-right
91