1. Is This a Good Title?
Martin Klein and Jeffery Shipman and Michael L. Nelson
Old Dominion University
{mklein,jshipman,mln}@cs.odu.edu
Hypertext 2010
Toronto, Canada
06/14/2010
This work is supported in part by the Library of Congress
3. The Problem
Internet Archive - www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine
3
4. The Problem
Internet Archive - www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine
59 copies
3
5. The Problem
Internet Archive - www.aircharter-international.com
http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter 59 copies
International
3
6. The Problem
www.aircharter-international.com
Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry
4
7. The Problem
www.aircharter-international.com
Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International
5
8. The Problem
http://www.drbartell.com/
Lexical Signature
(TF/IDF)
???
Plastic Surgeon
Reconstructive Dr
Bartell Symbol
University
6
9. The Problem
http://www.drbartell.com/
Title
Thomas Bartell MD
Board-Certified -
Cosmetic Plastic
Reconstructive
Surgery
7
10. The Problem
www.reagan.navy.mil
Lexical Signature
(TF/IDF)
Ronald USS MCSN
Torrey Naval Sea
Commanding
8
11. The Problem
www.reagan.navy.mil
???
Title
Home Page
9
12. The Problem
www.reagan.navy.mil
???
Title
Home Page
Is This a
Good Title?
9
13. Contributions
• Discuss discovery performance of web pages titles
(compared to LSs)
• Analysis of discovered pages regarding their
relevancy
• Display title evolution compared to content
evolution over time
• Provide prediction model for title’s retrieval potential
10
14. Experiment - Data Gathering
• 20k URIs randomly sampled from DMOZ
• Applied filters
• English language
• min. of 50 terms [Park]
• Results in 6.875 URIs
• Downloaded and parsed the pages
• Extract title and generate LS per page (baseline)
.com .org .net .edu sum
Original 15289 2755 1459 497 20000
Filtered 4863 1327 369 316 6875
[Park]
S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004 11
15. Title (and LS) Retrieval Performance
Titles 5- and 7-Term LSs
70
60
Top Ranked Top Ranked
Top 10 Top 10
Top 100 Top 100
60
Undiscovered Undiscovered
50
50
40
Relative Number of URLs
Relative Number of URLs
40
30
30
20
20
10
10
0
0
Top Top10 Top100 Undiscovered Top Top10 Top100 Undiscovered
• Titles return more than 60% URIs top ranked
• Binary retrieval pattern, URI either within top 10 or
undiscovered 12
16. Relevancy of Retrieval Results
Do titles return relevant
results besides the
original URI?
• Distinguish between
???
discovered (top 10) and
undiscovered URIs
• Analyze content of top 10
results
• Measure relevancy in terms of
normalized term overlap
and shingles between original
URI and search result by rank
13
17. Relevancy of Retrieval Results
Term Overlap
Discovered Undiscovered
6000
1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0
1500
5000
4000
1000
Frequency
Frequency
3000
2000
500
1000
0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank Rank
High relevancy in the top ranks
with possible aliases and duplicates.
14
18. Relevancy of Retrieval Results
Discovered
Shingles Undiscovered
1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0
1500
5000
4000
1000
Frequency
Frequency
3000
2000
500
1000
0
0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank Rank
More optimal shingles values than top ranked URIs -
possible aliases and duplicates.
15
19. Title Evolution - Example I
www.sun.com/solutions
1998-01-27 2004-02-02
Sun Software Products Selector Guides Sun Microsystems - Solutions
- Solutions Tree
2004-06-10
1999-02-20 Gateway Page - Sun Solutions
Sun Software Solutions
2006-01-09
2002-02-01 Sun Microsystems Solutions & Services
Sun Microsystems Products
2007-01-03
2002-06-01 Services & Solutions
Sun Microsystems - Business & Industry
Solutions 2007-02-07
Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry & 2008-01-19
Infrastructure Solutions Sun Solutions Sun Solutions
16
20. Title Evolution - Example I
www.sun.com/solutions
1998-01-27 2004-02-02
Sun Software Products Selector Guides Sun Microsystems - Solutions
- Solutions Tree
2004-06-10
1999-02-20 Gateway Page - Sun Solutions
Sun Software Solutions
2006-01-09
2002-02-01 Sun Microsystems Solutions & Services
Sun Microsystems Products
2007-01-03
2002-06-01 Services & Solutions
Sun Microsystems - Business & Industry
Solutions 2007-02-07
Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry & 2008-01-19
Infrastructure Solutions Sun Solutions Sun Solutions
16
21. Title Evolution - Example II
www.datacity.com/mainf.html
2002-10-16
2000-06-19
computer company in Manassas Virginia
DataCity of Manassas Park Main Page sells Custom Built Computers with
Removable Hard Drives Kits and
2000-10-12 Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells October 2002 DataCity 800-326-5051
Custom Built Computers & Removable toll free
Hard Drives
2006-03-14
2001-08-21 Est 1989 Computer company in Stafford
DataCity a computer company in Virginia sells Custom Built Secure
Manassas Park sells Custom Built Computers with DoD 5200.1-R
Computers & Removable Hard Drives Approved Removable Hard Drives,
Hard Drive Kits and Iomega 2GB Jaz
Drives (jazz drives), introduces the
IllumiNite; lighted keyboard DataCity
800-326-5051 Service Disabled Veteran
Owned Business SDVOB 17
22. Title Evolution - Example II
www.datacity.com/mainf.html
2002-10-16
2000-06-19
computer company in Manassas Virginia
DataCity of Manassas Park Main Page sells Custom Built Computers with
Removable Hard Drives Kits and
2000-10-12 Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells October 2002 DataCity 800-326-5051
Custom Built Computers & Removable toll free
Hard Drives
2006-03-14
2001-08-21 Est 1989 Computer company in Stafford
DataCity a computer company in Virginia sells Custom Built Secure
Manassas Park sells Custom Built Computers with DoD 5200.1-R
Computers & Removable Hard Drives Approved Removable Hard Drives,
Hard Drive Kits and Iomega 2GB Jaz
Drives (jazz drives), introduces the
IllumiNite; lighted keyboard DataCity
800-326-5051 Service Disabled Veteran
Owned Business SDVOB 17
23. Title Evolution Over Time
How much do titles
change over time?
• Copies from fixed size time
windows per year
• Extract available titles of past
14 years
• Compute normalized
Levenshtein edit
distance between titles of
copies and baseline
(0 = identical; 1 = completely
dissimilar)
18
24. Title Evolution Over Time
100
Title edit distance Unchanged 0
frequencies
Slightly Changed 0.1
0.2
0.3
0.4
80
•
0.5
Half the titles of 0.6
0.7
available copies from 0.8
0.9
recent years are 60
1.0
(close to) identical
•
40
Decay from 2005 on
(with fewer copies
available)
20
• 4 year old title:
40% chance to be
0
unchanged 2/2009 2/2007 2/2005 2/2003 2/2001 2/1999 2/1997
19
25. Title Evolution Over Time
Title vs Document
• Y: avg shingle value for
all copies per URI
• X: avg edit distance of
corresponding titles
• overlap indicated by:
green: <10
red: >90
• Semi-transparent: total
amount of points
plotted
20
26. Title Evolution Over Time
Title vs Document
• Y: avg shingle value for
all copies per URI
• X: avg edit distance of
corresponding titles
• overlap indicated by:
green: <10
red: >90
• Semi-transparent: total
amount of points
plotted [0,0] - 122 times
20
27. Title Evolution Over Time
Title vs Document
• Y: avg shingle value for
[0,1] - over 1600 times
all copies per URI
• X: avg edit distance of
corresponding titles
• overlap indicated by:
green: <10
red: >90
• Semi-transparent: total
amount of points
plotted [0,0] - 122 times
20
28. Title Performance Prediction
• Quality prediction of title by
• Number of nouns, articles etc.
• Amount of title terms, characters ([Ntoulas])
• Observation of re-occurring terms in poorly performing
titles - “Stop Titles”
home, index, home page, welcome, untitled document
[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
29. Title Performance Prediction
• Quality prediction of title by
• Number of nouns, articles etc.
• Amount of title terms, characters ([Ntoulas])
• Observation of re-occurring terms in poorly performing
titles - “Stop Titles”
home, index, home page, welcome, untitled document
The performance of any given title can
be predicted as insufficient if it consists
to 75% or more of a “Stop Title”!
[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
30. Concluding Remarks
The “aboutness” of web pages can be determined from either
the content or from the title.
More than 60% of URIs are returned top ranked when using
the title as a search engine query.
Titles change more slowly and less significantly over time than
the web pages’ content.
Not all titles are equally good.
If the majority of title terms are Stop Titles its quality can be
predicted poor.
22
31. Is This a Good Title?
Questions?
Martin Klein and Jeffrey Shipman and Michael L. Nelson
Old Dominion University
{mklein,jshipman,mln}@cs.odu.edu
23