Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering - Cicling12
1. Fuzzy Combinations of Criteria: An Application to
Web Page Representation for Clustering
Alberto P´rez Garc´
e ıa-Plaza, V´
ıctor Fresno, Raquel Mart´
ınez
NLP & IR Group, Distance Learning University (UNED)
CICLing 2012, New Delhi, India
March 15, 2012
2. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
2 / 30
3. Motivation Understanding the system Improving the Combination Summary
Motivation
Main goal
To understand how to represent web pages for clustering.
Question
How to combine different page features to represent web pages?
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
3 / 30
4. Motivation Understanding the system Improving the Combination Summary
Motivation
Main goal
To understand how to represent web pages for clustering.
Question
How to combine different page features to represent web pages?
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
3 / 30
5. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
4 / 30
6. Motivation Understanding the system Improving the Combination Summary
Web Page Representation
Hypothesis
A good document representation should be based on how humans
read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
5 / 30
7. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
Criteria:
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
8. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
§ ¤
Criteria: ¦
Title ¥
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
9. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
§ ¤§ ¤
Criteria: ¦
Title ¥Emphasis
¦ ¥
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
10. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
Word positions:
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
11. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
§ ¤
Word positions: ¦
Preferential ¥
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
12. Motivation Understanding the system Improving the Combination Summary
Different Criteria for Web Page
Representation
§ ¤§ ¤
Word positions: ¦
Preferential ¥Standard ¥
¦
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
6 / 30
13. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
7 / 30
14. Motivation Understanding the system Improving the Combination Summary
Linear Combination of Criteria
For example: Analytical Combination of Criteria (acc)1 .
Importance of a term in a document:
Ik = tk it + ek ie + fk if + pk ip (1)
Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4 (2)
Drawback
The importance of a term in a component is calculated regardless
the rest of the components.
1
V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf.
Syst., 22(3):215–235, 2004.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
8 / 30
15. Motivation Understanding the system Improving the Combination Summary
Linear Combination of Criteria
For example: Analytical Combination of Criteria (acc)1 .
Importance of a term in a document:
Ik = tk it + ek ie + fk if + pk ip (1)
Ik = 1 ∗ 0.4 + 0.6 ∗ 0.3 + 0 ∗ 0.2 + 0 ∗ 0.1 = 0.4 (2)
Drawback
The importance of a term in a component is calculated regardless
the rest of the components.
1
V. Fresno and A. Ribeiro. An analytical approach to concept extraction in html environments. J. Intell. Inf.
Syst., 22(3):215–235, 2004.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
8 / 30
16. Motivation Understanding the system Improving the Combination Summary
Example: acc
Call to Arms
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
9 / 30
17. Motivation Understanding the system Improving the Combination Summary
Example: acc
Example of rethoric title
“Call to arms” is the title of a
page that contains an article
about the new trades made by
New York Yankees baseball team
and how these trades affect to
Boston Red Sox, their main rival
in the Major League Baseball.
Drawback
Title terms are not related to
document topic.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
10 / 30
18. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
11 / 30
19. Motivation Understanding the system Improving the Combination Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear
combinations of criteria.
It is possible to define related conditions.
It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
12 / 30
20. Motivation Understanding the system Improving the Combination Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear
combinations of criteria.
It is possible to define related conditions.
It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
12 / 30
21. Motivation Understanding the system Improving the Combination Summary
Nonlinear Combination of Criteria
Fuzzy Combination of Criteria (fcc)2 allows nonlinear
combinations of criteria.
It is possible to define related conditions.
It produces vectors within the VSM.
2
A. Ribeiro, V. Fresno, M. C. Garcia-Alegre, and D. Guinea. A fuzzy system for the web page representation.
2003.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
12 / 30
22. Motivation Understanding the system Improving the Combination Summary
Example: fcc
Example of rethoric title
Now, we can express that a term
should appear in the title and
emphasized to be considered
important.
Nonlinearity
Title terms can be considered not
important because they do not
appear in the rest of the text.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
13 / 30
23. Motivation Understanding the system Improving the Combination Summary
Example: fcc
Example of rethoric title
Now, we can express that a term
should appear in the title and
emphasized to be considered
important.
Nonlinearity
Title terms can be considered not
important because they do not
appear in the rest of the text.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
13 / 30
24. Motivation Understanding the system Improving the Combination Summary
A quick glance at fcc
Close to natural language.
Knowledge base: defined by a set of IF-THEN rules.
Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
14 / 30
25. Motivation Understanding the system Improving the Combination Summary
A quick glance at fcc
Close to natural language.
Knowledge base: defined by a set of IF-THEN rules.
Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
14 / 30
26. Motivation Understanding the system Improving the Combination Summary
A quick glance at fcc
Close to natural language.
Knowledge base: defined by a set of IF-THEN rules.
Rules are based on how humans read documents.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
14 / 30
27. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
15 / 30
28. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
16 / 30
29. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
30. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
31. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
32. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
33. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
34. Motivation Understanding the system Improving the Combination Summary
Basic Clustering Settings
We remove stopwords, punctuation and suffixes (Porter’s
algorithm).
Clustering: Cluto-rbr with default parameters.
Web page representations: tf-idf and fcc
Dimension reduction techniques (100, 500, 1000, 2000 and
5000 features): mft and lsi.
Banksearch and Webkb.
F-measure to evaluate clustering quality.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
17 / 30
35. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
18 / 30
36. Motivation Understanding the system Improving the Combination Summary
Dimension Reduction Analysis
Hypothesis
If lsi improves mft, then the weighting function is not able to find
the most representative terms.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
19 / 30
37. Motivation Understanding the system Improving the Combination Summary
Rep. Avg. S.D.
Banksearch
tf-idf mft 0,748 0,028
tf-idf lsi 0,756 0,005
fcc mft 0,756 0,019
fcc lsi 0,769 0,011
Webkb
tf-idf mft 0,460 0,051
tf-idf lsi 0,507 0,006
fcc mft 0,469 0,009
fcc lsi 0,466 0,011
Conclusion
The weighting function is not working as well as it could.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
20 / 30
38. Motivation Understanding the system Improving the Combination Summary
Rep. Avg. S.D.
Banksearch
tf-idf mft 0,748 0,028
tf-idf lsi 0,756 0,005
fcc mft 0,756 0,019
fcc lsi 0,769 0,011
Webkb
tf-idf mft 0,460 0,051
tf-idf lsi 0,507 0,006
fcc mft 0,469 0,009
fcc lsi 0,466 0,011
Conclusion
Results for fcc in Webkb dataset are surprisingly bad.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
20 / 30
39. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
21 / 30
40. Motivation Understanding the system Improving the Combination Summary
Results for Criteria Analysis
Rep.Dim. 100 500 1000 2000 5000
Banksearch
fcc mft 0,723 0,757 0,768 0,765 0,768
title 0,626 0,646 0,632 0,634 0,639
emphasis 0,586 0,671 0,674 0,685 0,693
frequency 0,689 0,715 0,720 0,724 0,731
position 0,310 0,525 0,538 0,599 0,608
For Banksearch, fcc get always higher values than individual
criteria, so the combination works better in all cases.
Frequency seems to be the best among the individual criteria.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
22 / 30
41. Motivation Understanding the system Improving the Combination Summary
Results for Criteria Analysis
Rep.Dim. 100 500 1000 2000 5000
Webkb
fcc mft 0,453 0,472 0,475 0,468 0,475
title 0,432 0,433 0,404 0,488 0,479
emphasis 0,415 0,431 0,433 0,465 0,489
frequency 0,441 0,460 0,460 0,468 0,446
position 0,301 0,283 0,317 0,281 0,286
For Webkb, fcc does not always outperform the others.
Frequency is not always the best among the individual criteria.
When title and emphasis could lead to a better clustering, the
combination get worse.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
23 / 30
42. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
24 / 30
43. Motivation Understanding the system Improving the Combination Summary
Improving the Combination
Frequency should influence the decision more than position.
IF Title AND Frequency AND Emphasis AND Position THEN Importance
Low Medium Low Preferential ⇒ Low
Low Medium Low Standard ⇒ No
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
25 / 30
44. Motivation Understanding the system Improving the Combination Summary
Extended Fuzzy Combination of Criteria (efcc)
IF Title AND Frequency AND Emphasis AND Position THEN Importance
High High ⇒ Very High
High Medium Preferential ⇒ High
High Medium Standard ⇒ Medium
High Low Preferential ⇒ Medium
High Low Standard ⇒ Low
Low High Preferential ⇒ High
Low High Standard ⇒ Medium
Low Medium Preferential ⇒ Medium
Low Medium Standard ⇒ Low
Low Low Preferential ⇒ Low
Low Low Standard ⇒ No
High ⇒ Very High
Medium ⇒ Medium
Low ⇒ No
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
26 / 30
45. Motivation Understanding the system Improving the Combination Summary
System Comparison
With efcc, both reduction methods get similar results.
Rep. Avg. S.D.
Banksearch
tf-idf lsi 0,756 0,005
fcc lsi 0,769 0,011
efcc mft 0,760 0,014
efcc lsi 0,758 0,013
Webkb
tf-idf lsi 0,507 0,006
fcc mft 0,469 0,009
efcc mft 0,532 0,032
efcc lsi 0,483 0,000
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
27 / 30
46. Motivation Understanding the system Improving the Combination Summary
System Comparison
efcc solves the problems of fcc in Webkb.
Rep. Avg. S.D.
Banksearch
tf-idf lsi 0,756 0,005
fcc lsi 0,769 0,011
efcc mft 0,760 0,014
efcc lsi 0,758 0,013
Webkb
tf-idf lsi 0,507 0,006
fcc mft 0,469 0,009
efcc mft 0,532 0,032
efcc lsi 0,483 0,000
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
27 / 30
47. Motivation Understanding the system Improving the Combination Summary
Table of Contents
1 Motivation
Web Page Representation
Linear Combination of Criteria
Nonlinear Combination of Criteria
2 Understanding the system
Experimental Settings
Dimension Reduction Analysis
Study of Individual Criteria
3 Improving the Combination
4 Summary
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
28 / 30
48. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
49. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
50. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
51. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
52. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
53. Motivation Understanding the system Improving the Combination Summary
Summary
We present a term weighting function based on how human
read documents.
The representation is not oriented to concrete sets of web
pages.
Nonlinear systems help express relations among criteria.
With a good term weighting function it is possible to use
lightweight dimension reduction techniques.
Our system try to ease the communication between technical
and linguistic experts.
Anchor texts were also studied as a way of adding contextual
information.
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
29 / 30
54. Motivation Understanding the system Improving the Combination Summary
Thank You!
Fuzzy Combinations of Criteria: An Application to Web Page Representation for Clustering
30 / 30