SlideShare une entreprise Scribd logo
1  sur  31
Is This a Good Title?


Martin Klein and Jeffery Shipman and Michael L. Nelson
                Old Dominion University

         {mklein,jshipman,mln}@cs.odu.edu

                      Hypertext 2010
                      Toronto, Canada
                        06/14/2010

                   This work is supported in part by the Library of Congress
The Problem

Professional Scholarly Publishing 2003
http://www.pspcentral.org/events/annual_meeting_2003.html




                                                            2
The Problem
Internet Archive -                     www.aircharter-international.com
                     http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine




                                                                                            3
The Problem
Internet Archive -                     www.aircharter-international.com
                     http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine




                           59 copies
                                                                                            3
The Problem
Internet Archive -                         www.aircharter-international.com
                         http://web.archive.org/web/*/http://www.aircharter-international.com
Wayback Machine

Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry

Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter           59 copies
International
                                                                                                3
The Problem
                     www.aircharter-international.com



Lexical Signature
(TF/IDF)
Charter Aircraft Cargo
Passenger Jet Air
Enquiry




                                                        4
The Problem
                        www.aircharter-international.com


Title
ACMI, Private Jet
Charter, Private Jet
Lease, Charter Flight
Service: Air Charter
International




                                                           5
The Problem
                    http://www.drbartell.com/


Lexical Signature
(TF/IDF)

                                                ???
Plastic Surgeon
Reconstructive Dr
Bartell Symbol
University




                                                      6
The Problem
                    http://www.drbartell.com/


Title
Thomas Bartell MD
Board-Certified -
Cosmetic Plastic
Reconstructive
Surgery




                                                7
The Problem
                     www.reagan.navy.mil



Lexical Signature
(TF/IDF)
Ronald USS MCSN
Torrey Naval Sea
Commanding




                                           8
The Problem
             www.reagan.navy.mil




                                   ???
Title
Home Page




                                         9
The Problem
              www.reagan.navy.mil




                                    ???
Title
Home Page



 Is This a
Good Title?

                                          9
Contributions


•   Discuss discovery performance of web pages titles
    (compared to LSs)

•   Analysis of discovered pages regarding their
    relevancy

•   Display title evolution compared to content
    evolution over time

•   Provide prediction model for title’s retrieval potential




                                                               10
Experiment - Data Gathering

      •     20k URIs randomly sampled from DMOZ

      •     Applied filters
           •     English language
           •     min. of 50 terms [Park]

      •     Results in 6.875 URIs

      •     Downloaded and parsed the pages

      •     Extract title and generate LS per page (baseline)

                                             .com               .org             .net           .edu             sum

                   Original                 15289              2755             1459             497          20000
                    Filtered                 4863              1327              369             316           6875
[Park]
S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004   11
Title (and LS) Retrieval Performance

                                         Titles                                                            5- and 7-Term LSs
                          70




                                                                                                      60
                                                             Top Ranked                                                              Top Ranked
                                                             Top 10                                                                  Top 10
                                                             Top 100                                                                 Top 100
                          60




                                                             Undiscovered                                                            Undiscovered




                                                                                                      50
                          50




                                                                                                      40
Relative Number of URLs




                                                                            Relative Number of URLs
                          40




                                                                                                      30
                          30




                                                                                                      20
                          20




                                                                                                      10
                          10
                          0




                                                                                                      0


                                   Top   Top10    Top100   Undiscovered                                    Top   Top10   Top100   Undiscovered




                               •   Titles return more than 60% URIs top ranked

                               •   Binary retrieval pattern, URI either within top 10 or
                                   undiscovered                                                                                                     12
Relevancy of Retrieval Results

Do titles return relevant
results besides the
original URI?

•   Distinguish between
                                    ???
    discovered (top 10) and
    undiscovered URIs

•   Analyze content of top 10
    results

•   Measure relevancy in terms of
    normalized term overlap
    and shingles between original
    URI and search result by rank
                                          13
Relevancy of Retrieval Results
                                                                       Term Overlap
                                   Discovered                                                                               Undiscovered
            6000




                           1        > 0.75              > 0.5       > 0.0       0                                   1         > 0.75              > 0.5       > 0.0       0




                                                                                                     1500
            5000
            4000




                                                                                                     1000
Frequency




                                                                                         Frequency
            3000
            2000




                                                                                                     500
            1000
            0




                                                                                                     0

                   1   2       3    4        5          6       7     8     9       10                      1   2       3     4        5          6       7     8     9       10

                                                 Rank                                                                                      Rank




                                                    High relevancy in the top ranks
                                                  with possible aliases and duplicates.
                                                                                                                                                                                   14
Relevancy of Retrieval Results

                                   Discovered
                                                                                Shingles                                    Undiscovered
                           1        > 0.75              > 0.5       > 0.0       0                                   1         > 0.75              > 0.5       > 0.0       0




                                                                                                     1500
            5000
            4000




                                                                                                     1000
Frequency




                                                                                         Frequency
            3000
            2000




                                                                                                     500
            1000
            0




                                                                                                     0


                   1   2       3    4        5          6       7     8     9       10                      1   2       3     4        5          6       7     8     9       10

                                                 Rank                                                                                      Rank




                               More optimal shingles values than top ranked URIs -
                                        possible aliases and duplicates.
                                                                                                                                                                                   15
Title Evolution - Example I
                         www.sun.com/solutions
1998-01-27                               2004-02-02
Sun Software Products Selector Guides    Sun Microsystems - Solutions
- Solutions Tree
                                         2004-06-10
1999-02-20                               Gateway Page - Sun Solutions
Sun Software Solutions
                                         2006-01-09
2002-02-01                               Sun Microsystems Solutions & Services
Sun Microsystems Products
                                         2007-01-03
2002-06-01                               Services & Solutions
Sun Microsystems - Business & Industry
Solutions                                2007-02-07
                                         Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry &            2008-01-19
Infrastructure Solutions Sun Solutions   Sun Solutions
                                                                                 16
Title Evolution - Example I
                         www.sun.com/solutions
1998-01-27                               2004-02-02
Sun Software Products Selector Guides    Sun Microsystems - Solutions
- Solutions Tree
                                         2004-06-10
1999-02-20                               Gateway Page - Sun Solutions
Sun Software Solutions
                                         2006-01-09
2002-02-01                               Sun Microsystems Solutions & Services
Sun Microsystems Products
                                         2007-01-03
2002-06-01                               Services & Solutions
Sun Microsystems - Business & Industry
Solutions                                2007-02-07
                                         Sun Services & Solutions
2003-08-01
Sun Microsystems - Industry &            2008-01-19
Infrastructure Solutions Sun Solutions   Sun Solutions
                                                                                 16
Title Evolution - Example II
                www.datacity.com/mainf.html
                                      2002-10-16
2000-06-19
                                      computer company in Manassas Virginia
DataCity of Manassas Park Main Page   sells Custom Built Computers with
                                      Removable Hard Drives Kits and
2000-10-12                            Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells       October 2002 DataCity 800-326-5051
Custom Built Computers & Removable    toll free
Hard Drives
                                      2006-03-14
2001-08-21                            Est 1989 Computer company in Stafford
DataCity a computer company in        Virginia sells Custom Built Secure
Manassas Park sells Custom Built      Computers with DoD 5200.1-R
Computers & Removable Hard Drives     Approved Removable Hard Drives,
                                      Hard Drive Kits and Iomega 2GB Jaz
                                      Drives (jazz drives), introduces the
                                      IllumiNite; lighted keyboard DataCity
                                      800-326-5051 Service Disabled Veteran
                                      Owned Business SDVOB                  17
Title Evolution - Example II
                www.datacity.com/mainf.html
                                      2002-10-16
2000-06-19
                                      computer company in Manassas Virginia
DataCity of Manassas Park Main Page   sells Custom Built Computers with
                                      Removable Hard Drives Kits and
2000-10-12                            Iomega 2GB Jaz Drives (jazz drives)
DataCity of Manassas Park sells       October 2002 DataCity 800-326-5051
Custom Built Computers & Removable    toll free
Hard Drives
                                      2006-03-14
2001-08-21                            Est 1989 Computer company in Stafford
DataCity a computer company in        Virginia sells Custom Built Secure
Manassas Park sells Custom Built      Computers with DoD 5200.1-R
Computers & Removable Hard Drives     Approved Removable Hard Drives,
                                      Hard Drive Kits and Iomega 2GB Jaz
                                      Drives (jazz drives), introduces the
                                      IllumiNite; lighted keyboard DataCity
                                      800-326-5051 Service Disabled Veteran
                                      Owned Business SDVOB                  17
Title Evolution Over Time

How much do titles
change over time?

•   Copies from fixed size time
    windows per year

•   Extract available titles of past
    14 years

•   Compute normalized
    Levenshtein edit
    distance between titles of
    copies and baseline
    (0 = identical; 1 = completely
    dissimilar)
                                        18
Title Evolution Over Time




                            100
Title edit distance                        Unchanged                                                 0

    frequencies
                                           Slightly Changed                                          0.1
                                                                                                     0.2
                                                                                                     0.3
                                                                                                     0.4




                            80
•
                                                                                                     0.5
    Half the titles of                                                                               0.6
                                                                                                     0.7
    available copies from                                                                            0.8
                                                                                                     0.9
    recent years are        60
                                                                                                     1.0

    (close to) identical

•
                            40




    Decay from 2005 on
    (with fewer copies
    available)
                            20




•   4 year old title:
    40% chance to be
                            0




    unchanged                     2/2009      2/2007    2/2005   2/2003   2/2001   2/1999   2/1997

                                                                                                           19
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted

                                           20
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted                    [0,0] - 122 times
                                                   20
Title Evolution Over Time
    Title vs Document

•   Y: avg shingle value for
                               [0,1] - over 1600 times
    all copies per URI

•   X: avg edit distance of
    corresponding titles

•   overlap indicated by:
    green: <10
    red: >90

•   Semi-transparent: total
    amount of points
    plotted                         [0,0] - 122 times
                                                         20
Title Performance Prediction

    •    Quality prediction of title by

        •    Number of nouns, articles etc.

        •    Amount of title terms, characters ([Ntoulas])

    •    Observation of re-occurring terms in poorly performing
         titles - “Stop Titles”

    home, index, home page, welcome, untitled document




[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92   21
Title Performance Prediction

    •    Quality prediction of title by

        •    Number of nouns, articles etc.

        •    Amount of title terms, characters ([Ntoulas])

    •    Observation of re-occurring terms in poorly performing
         titles - “Stop Titles”

    home, index, home page, welcome, untitled document

                  The performance of any given title can
                  be predicted as insufficient if it consists
                     to 75% or more of a “Stop Title”!
[Ntoulas]
A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92   21
Concluding Remarks

The “aboutness” of web pages can be determined from either
the content or from the title.

More than 60% of URIs are returned top ranked when using
the title as a search engine query.

Titles change more slowly and less significantly over time than
the web pages’ content.

Not all titles are equally good.
If the majority of title terms are Stop Titles its quality can be
predicted poor.

                                                                    22
Is This a Good Title?



                    Questions?



Martin Klein and Jeffrey Shipman and Michael L. Nelson
                Old Dominion University

         {mklein,jshipman,mln}@cs.odu.edu
                                                         23

Contenu connexe

Similaire à Is This a Good Title?

Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationIan Mulvany
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
New trends in corporate information centres
New trends in corporate information centresNew trends in corporate information centres
New trends in corporate information centresEduserv
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web PagesMichael Nelson
 
Tech, Reference, AND PATRON Views of our new Front-End
Tech, Reference, AND PATRON Views of our new Front-EndTech, Reference, AND PATRON Views of our new Front-End
Tech, Reference, AND PATRON Views of our new Front-Endkramsey
 
高性能网站建设指南
高性能网站建设指南高性能网站建设指南
高性能网站建设指南Bob Huang
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data ChallengeKnud Möller
 
2011 Search Query Rewrites - Synonyms & Acronyms
2011 Search Query Rewrites - Synonyms & Acronyms2011 Search Query Rewrites - Synonyms & Acronyms
2011 Search Query Rewrites - Synonyms & AcronymsBrian Johnson
 
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015YOPESO
 

Similaire à Is This a Good Title? (10)

Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
New trends in corporate information centres
New trends in corporate information centresNew trends in corporate information centres
New trends in corporate information centres
 
(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages(Re-) Discovering Lost Web Pages
(Re-) Discovering Lost Web Pages
 
Tech, Reference, AND PATRON Views of our new Front-End
Tech, Reference, AND PATRON Views of our new Front-EndTech, Reference, AND PATRON Views of our new Front-End
Tech, Reference, AND PATRON Views of our new Front-End
 
高性能网站建设指南
高性能网站建设指南高性能网站建设指南
高性能网站建设指南
 
History and Background of the USEWOD Data Challenge
History and Background of the  USEWOD Data ChallengeHistory and Background of the  USEWOD Data Challenge
History and Background of the USEWOD Data Challenge
 
2011 Search Query Rewrites - Synonyms & Acronyms
2011 Search Query Rewrites - Synonyms & Acronyms2011 Search Query Rewrites - Synonyms & Acronyms
2011 Search Query Rewrites - Synonyms & Acronyms
 
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015
Inheritance - the myth of code reuse | Andrei Raifura | CodeWay 2015
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
Online Public Compound Databases
 

Plus de Martin Klein

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly WebMartin Klein
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...Martin Klein
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...Martin Klein
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncMartin Klein
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsMartin Klein
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...Martin Klein
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento RequestsMartin Klein
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesMartin Klein
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsMartin Klein
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web ResourcesMartin Klein
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for RepositoriesMartin Klein
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDMartin Klein
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationMartin Klein
 

Plus de Martin Klein (20)

On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
On the Persistence of Persistent Identifiers of the Scholarly Web
 On the Persistence of Persistent Identifiers of the Scholarly Web On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Who is Asking - Humans and Machines Experience a Different Scholarly Web
Who is Asking - Humans and Machines  Experience a Different Scholarly WebWho is Asking - Humans and Machines  Experience a Different Scholarly Web
Who is Asking - Humans and Machines Experience a Different Scholarly Web
 
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...The Memento Tracer Framework: Balancing Quality and Scalability  for Web Arch...
The Memento Tracer Framework: Balancing Quality and Scalability for Web Arch...
 
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...Memento Tracer An Innovative Approach Towards Balancing  Scale and Fidelity f...
Memento Tracer An Innovative Approach Towards Balancing Scale and Fidelity f...
 
Comparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSyncComparing the Performance of OAI-PMH with ResourceSync
Comparing the Performance of OAI-PMH with ResourceSync
 
Evaluating Memento Service Optimizations
Evaluating Memento Service OptimizationsEvaluating Memento Service Optimizations
Evaluating Memento Service Optimizations
 
An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...First Steps in Research Data Management Under Constraints of a National Secur...
First Steps in Research Data Management Under Constraints of a National Secur...
 
Smart Routing of Memento Requests
Smart Routing of Memento RequestsSmart Routing of Memento Requests
Smart Routing of Memento Requests
 
Building Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web ArchivesBuilding Event Collections from Crawling Web Archives
Building Event Collections from Crawling Web Archives
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Focused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event CollectionsFocused Crawl of Web Archives to Build Event Collections
Focused Crawl of Web Archives to Build Event Collections
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
Signposting for Repositories
Signposting for RepositoriesSignposting for Repositories
Signposting for Repositories
 
Discovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCIDDiscovering Scholarly Orphans Using ORCID
Discovering Scholarly Orphans Using ORCID
 
Using the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly CommunicationUsing the Memento Framework to Assess Content Drift in Scholarly Communication
Using the Memento Framework to Assess Content Drift in Scholarly Communication
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Is This a Good Title?

  • 1. Is This a Good Title? Martin Klein and Jeffery Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu Hypertext 2010 Toronto, Canada 06/14/2010 This work is supported in part by the Library of Congress
  • 2. The Problem Professional Scholarly Publishing 2003 http://www.pspcentral.org/events/annual_meeting_2003.html 2
  • 3. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 3
  • 4. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine 59 copies 3
  • 5. The Problem Internet Archive - www.aircharter-international.com http://web.archive.org/web/*/http://www.aircharter-international.com Wayback Machine Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter 59 copies International 3
  • 6. The Problem www.aircharter-international.com Lexical Signature (TF/IDF) Charter Aircraft Cargo Passenger Jet Air Enquiry 4
  • 7. The Problem www.aircharter-international.com Title ACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International 5
  • 8. The Problem http://www.drbartell.com/ Lexical Signature (TF/IDF) ??? Plastic Surgeon Reconstructive Dr Bartell Symbol University 6
  • 9. The Problem http://www.drbartell.com/ Title Thomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery 7
  • 10. The Problem www.reagan.navy.mil Lexical Signature (TF/IDF) Ronald USS MCSN Torrey Naval Sea Commanding 8
  • 11. The Problem www.reagan.navy.mil ??? Title Home Page 9
  • 12. The Problem www.reagan.navy.mil ??? Title Home Page Is This a Good Title? 9
  • 13. Contributions • Discuss discovery performance of web pages titles (compared to LSs) • Analysis of discovered pages regarding their relevancy • Display title evolution compared to content evolution over time • Provide prediction model for title’s retrieval potential 10
  • 14. Experiment - Data Gathering • 20k URIs randomly sampled from DMOZ • Applied filters • English language • min. of 50 terms [Park] • Results in 6.875 URIs • Downloaded and parsed the pages • Extract title and generate LS per page (baseline) .com .org .net .edu sum Original 15289 2755 1459 497 20000 Filtered 4863 1327 369 316 6875 [Park] S.T. Park et al. “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web” ACM ToIS 22(4):540-572, 2004 11
  • 15. Title (and LS) Retrieval Performance Titles 5- and 7-Term LSs 70 60 Top Ranked Top Ranked Top 10 Top 10 Top 100 Top 100 60 Undiscovered Undiscovered 50 50 40 Relative Number of URLs Relative Number of URLs 40 30 30 20 20 10 10 0 0 Top Top10 Top100 Undiscovered Top Top10 Top100 Undiscovered • Titles return more than 60% URIs top ranked • Binary retrieval pattern, URI either within top 10 or undiscovered 12
  • 16. Relevancy of Retrieval Results Do titles return relevant results besides the original URI? • Distinguish between ??? discovered (top 10) and undiscovered URIs • Analyze content of top 10 results • Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank 13
  • 17. Relevancy of Retrieval Results Term Overlap Discovered Undiscovered 6000 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank High relevancy in the top ranks with possible aliases and duplicates. 14
  • 18. Relevancy of Retrieval Results Discovered Shingles Undiscovered 1 > 0.75 > 0.5 > 0.0 0 1 > 0.75 > 0.5 > 0.0 0 1500 5000 4000 1000 Frequency Frequency 3000 2000 500 1000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Rank Rank More optimal shingles values than top ranked URIs - possible aliases and duplicates. 15
  • 19. Title Evolution - Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
  • 20. Title Evolution - Example I www.sun.com/solutions 1998-01-27 2004-02-02 Sun Software Products Selector Guides Sun Microsystems - Solutions - Solutions Tree 2004-06-10 1999-02-20 Gateway Page - Sun Solutions Sun Software Solutions 2006-01-09 2002-02-01 Sun Microsystems Solutions & Services Sun Microsystems Products 2007-01-03 2002-06-01 Services & Solutions Sun Microsystems - Business & Industry Solutions 2007-02-07 Sun Services & Solutions 2003-08-01 Sun Microsystems - Industry & 2008-01-19 Infrastructure Solutions Sun Solutions Sun Solutions 16
  • 21. Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
  • 22. Title Evolution - Example II www.datacity.com/mainf.html 2002-10-16 2000-06-19 computer company in Manassas Virginia DataCity of Manassas Park Main Page sells Custom Built Computers with Removable Hard Drives Kits and 2000-10-12 Iomega 2GB Jaz Drives (jazz drives) DataCity of Manassas Park sells October 2002 DataCity 800-326-5051 Custom Built Computers & Removable toll free Hard Drives 2006-03-14 2001-08-21 Est 1989 Computer company in Stafford DataCity a computer company in Virginia sells Custom Built Secure Manassas Park sells Custom Built Computers with DoD 5200.1-R Computers & Removable Hard Drives Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB 17
  • 23. Title Evolution Over Time How much do titles change over time? • Copies from fixed size time windows per year • Extract available titles of past 14 years • Compute normalized Levenshtein edit distance between titles of copies and baseline (0 = identical; 1 = completely dissimilar) 18
  • 24. Title Evolution Over Time 100 Title edit distance Unchanged 0 frequencies Slightly Changed 0.1 0.2 0.3 0.4 80 • 0.5 Half the titles of 0.6 0.7 available copies from 0.8 0.9 recent years are 60 1.0 (close to) identical • 40 Decay from 2005 on (with fewer copies available) 20 • 4 year old title: 40% chance to be 0 unchanged 2/2009 2/2007 2/2005 2/2003 2/2001 2/1999 2/1997 19
  • 25. Title Evolution Over Time Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted 20
  • 26. Title Evolution Over Time Title vs Document • Y: avg shingle value for all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
  • 27. Title Evolution Over Time Title vs Document • Y: avg shingle value for [0,1] - over 1600 times all copies per URI • X: avg edit distance of corresponding titles • overlap indicated by: green: <10 red: >90 • Semi-transparent: total amount of points plotted [0,0] - 122 times 20
  • 28. Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
  • 29. Title Performance Prediction • Quality prediction of title by • Number of nouns, articles etc. • Amount of title terms, characters ([Ntoulas]) • Observation of re-occurring terms in poorly performing titles - “Stop Titles” home, index, home page, welcome, untitled document The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”! [Ntoulas] A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92 21
  • 30. Concluding Remarks The “aboutness” of web pages can be determined from either the content or from the title. More than 60% of URIs are returned top ranked when using the title as a search engine query. Titles change more slowly and less significantly over time than the web pages’ content. Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor. 22
  • 31. Is This a Good Title? Questions? Martin Klein and Jeffrey Shipman and Michael L. Nelson Old Dominion University {mklein,jshipman,mln}@cs.odu.edu 23

Notes de l'éditeur