I presented this paper at iPres 2018. Here, we introduce the Off-Topic Memento Toolkit, used to detect versions of web pages that have drifted off topic from the general topic of a collection.
1. The Off-Topic Memento
Toolkit
Shawn M. Jones Michele C. Weigle Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
sjone@cs.odu.edu
@shawnmjones
mweigle@cs.odu.edu
@weiglemc
mln@cs.odu.edu
@phonedude_mln
Thanks to:
2. @shawnmjones @WebSciDL
Many Curators Use Archive-It To Create Web Archive
Collections
2
Archive-It makes it easy for curators to build collections and supply metadata
for a collection.
3. @shawnmjones @WebSciDL
When Building A Web Archive Collection…
Curators select web resources
as seeds
Each version of a seed
becomes a memento
3
4. @shawnmjones @WebSciDL
When Building A Web Archive Collection…
Curators select web resources
as seeds
Each version of a seed
becomes a memento
They create a web archive
collection with a purpose in
mind
4
5. @shawnmjones @WebSciDL
When Researchers Prepare to Analyze a Web Archive
Collection…
5
Some collections have thousands of seeds.
Remember: Each seed has one or more
mementos.
The sheer number of mementos to process
means that researchers will need to quickly
identify mementos with low information value.
Off-topic mementos have low information value.
We want to identify, not delete, these for further
decision-making.
We identify them to not consider them for selection
as exemplars for storytelling.
81,014 seeds
486,227 seed mementos
7. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
For Technical Reasons
7
http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/
http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
8. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Page Gone
8
http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/
http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
9. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Content Drift – A Change in Languages
9
http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/
http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
10. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Server Maintenance
10
http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/
http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
11. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Account Suspension
11
http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php
http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
12. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Site Redesign
12
http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/
http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
13. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Change Site Ownership
13
http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php
http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
14. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Was Hacked
14
http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/
http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
15. @shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Moves On From The Topic
15
http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/
http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
17. @shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT)
Currently in alpha status, the
OTMT
Accepts a collection of mementos
Executes similarity measures on
those mementos
Rates them as on or off-topic
Identifies, does not delete, off-
topic mementos
17
https://github.com/oduwsdl/off-topic-memento-toolkit
19. @shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
19
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
20. @shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
20
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
21. @shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
21
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Like these studies,
we also use these
similarity measures
on mementos
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
22. @shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
22
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
23. @shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
23
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
24. @shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
24
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
25. @shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
25
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
We build on AlNoamany’s work to
bring you the Off-Topic Memento
Toolkit
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
26. @shawnmjones @WebSciDL
Memento Protocol Terminology
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";
datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";
datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 04:41:56 GMT"
…
26
Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-
datetimes.
URI-T: a URI for a TimeMap
URI-M: a URI for a memento
27. @shawnmjones @WebSciDL
Web Archives Augment Their Mementos
27
Banners Rewritten Links
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
28. @shawnmjones @WebSciDL
The OTMT Uses Raw Mementos
28
Raw mementos are free of these augmentations.
Archive-It and the Internet Archive provide access
to raw mementos at special URIs.
The OTMT finds these raw mementos and uses
them in its similarity comparisons.
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
29. @shawnmjones @WebSciDL
The OTMT Performs Preprocessing
29
<p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally
to ensure justice for women and an independent and effective International Criminal Court.</p>
['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for',
'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.']
Tokenization
Remove stop words
['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent',
'effective', 'International', 'Criminal', 'Court']
Stemming
['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern',
'crimin', 'court']
Boilerplate removal
The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent
and effective International Criminal Court.
30. @shawnmjones @WebSciDL
We Evaluated the OTMT with a Gold Standard Dataset
In “Detecting off-topic pages within
TimeMaps in Web archives”,
AlNoamany performed a study to
detect off-topic Mementos
The mementos were manually
marked as on or off-topic
We reuse this dataset in our
evaluation
30
https://github.com/oduwsdl/offtopic-goldstandard-data
32. @shawnmjones @WebSciDL
General algorithm
For each TimeMap in a collection
1. Get the first memento
2. Preprocess it
3. For each memento in the TimeMap
1. Get the memento
2. Preprocess it
3. Compute the similarity to the first
memento using a given measure
4. Save the score
5. A threshold value determines if a
memento is on or off-topic
32
First
memento
Considered
memento
33. @shawnmjones @WebSciDL
Structural Measures – Byte Count and Word Count
33
On-topic:
9599 bytes
183 words (after preprocessing)
Off-topic:
401 bytes
22 words (after preprocessing)
Off-topic mementos tend to have less bytes/words
Scores range from 0 to -1
34. @shawnmjones @WebSciDL
Set Operation Measures
34
Jaccard Distance Sørensen-Dice Distance
Size of Intersection over size of union Twice the size of
intersection over size of
both sets
Scores range from 0 to 1
['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda',
'democrat', 'republ', 'congo', 'libya']
['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat',
'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya',
'kyrgyzstan']
Highlighted words are the intersection
Words from
Doc #1:
Words from
Doc #2:
35. @shawnmjones @WebSciDL
Simhash of Term Frequencies
35
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('republ', 2),
('human', 1),
…
13221438115839111206 13797903006343525414
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('human', 1),
('right', 1),
…
6 bits
Scores range from 0 to 64 bits
Simhash Distance:
Simhash of Terms and Frequencies
from Document #1:
Simhash of Terms and Frequencies
from Document #2:
36. @shawnmjones @WebSciDL
Simhash of raw content
36
The Women’s Initiatives for Gender Justice is an international women’s human
rights organisation that advocates for gender justice through the International
Criminal Court (ICC) and through domestic mechanisms, including peace
negotiations and justice processes.We work with women and communities
most affected by the armed conflict with a focus on countries with situations
under investigation by the ICC. The Women’s Initiatives for Gender Justice
currently works in Uganda, the Democratic Republic of the Congo and Libya.
The Women’s Initiatives for Gender Justice is an international women’s
human rights organisation that advocates for gender justice through the
International Criminal Court (ICC) and through domestic mechanisms,
including peace negotiations and justice processes. We work with women
most affected by the conflict situations under investigation by the ICC. The
Women’s Initiatives for Gender Justice works in Uganda, the Democratic
Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya
and Kyrgyzstan.
12358429319379250844 12359555184926328508
6 bits
Scores range from 0 to 64 bits
Simhash of Document #1: Simhash of Document #2:
Simhash Distance:
37. @shawnmjones @WebSciDL
Cosine Similiarities
37
Take the cosine of the document vectors.
Cosine of TF-IDF
Vectors are formed from each document and their term frequencies.
Cosine of Latent Semantic Indexing (LSI)
Each vector is informed by LSI.
Scores range from 1 to 0.
40. @shawnmjones @WebSciDL
OTMT Usage
40
# detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json
Input Types for -i:
• timemap – followed by 1 or more
TimeMap URIs, separated by
commas
• warc – followed by 1 or more
WARC files, separated by
commas
• archiveit – followed by an
Archive-It collection ID
TimeMap measures for -tm:
• bytecount
• wordcount
• jaccard
• sorensen
• simhash-tf
• simhash-raw
• cosine
• gensim_lsi
Input
OutputMeasure
Output types for -ot:
• json
• csv
41. @shawnmjones @WebSciDL
OTMT Output - JSON
41
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
URI-T of TimeMap
URI-M of Memento
42. @shawnmjones @WebSciDL
OTMT Output - JSON
42
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
URI-T of TimeMap
URI-M of Memento
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
If one measure scores as off-topic, the memento is considered off-topic
43. @shawnmjones @WebSciDL
Supported Similarity Measures
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
43
45. @shawnmjones @WebSciDL
Experiment setup
For each measure:
1. Start the threshold at the score of
complete dissimilarity
2. Test with the URI-Ms from the gold
standard data set as if that threshold
indicated off-topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold
5. Repeat 2 – 4 until the threshold
matches complete equivalence score
45
Example using Byte Count:
1. Start threshold at -1
2. Test with the URI-Ms from the gold
standard data set as if -1 indicated off-
topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold to -0.99
5. Test with the URI-Ms from the gold
standard data set as if -0.99 indicated
off-topic
6. Compute F1 with real status
7. Increment to -0.98
8. Repeat until the threshold is 0
46. @shawnmjones @WebSciDL
Our results do not match AlNoamany’s, but the world is
not the same as it was in 2015…
AlNoamany’s Study Our Study
Year Conducted 2015 2017
Boilerplate Removal Boilerpipe (Java) Justext
Tokenization and Stemming Scikit-learn NLTK
46
Other changes:
• Download errors
• Gold Standard Dataset updates
53. @shawnmjones @WebSciDL
Cosine Similarity of TF-IDF Vectors
53
Our Results: AlNoamany’s Results
Best F1 Score: 0.881
Threshold: 0.15
Best score in AlNoamany’s Results
55. @shawnmjones @WebSciDL
Results Summarized – Best F1 Score is Word Count
55
AlNoamany's Results Results of this study
Similarity Measure
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7
Cosine Similarity of TF-IDF
Vectors 0.881 0.983 0.15 0.766 0.965 0.12
Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39
Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics
Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94
Sørensen-Dice Distance Not tested 0.649 0.953 0.88
Simhash on raw memento
content Not tested 0.578 0.934 25
Simhash on TF Not tested 0.523 0.942 28
Our word count measure came out ahead of AlNoamany’s.
AlNoamany’s Cosine Similarity measure came out ahead of ours.
56. @shawnmjones @WebSciDL
What about using measures together?
56
AlNoamany found that using cosine
similarity of TF-IDF vectors and word
count together produced even better
results.
Our best F1 score for word count
alone was 0.788.
Word count combined with LSI
turned out to be slightly better with
the same Accuracy.
The success of word count appears
to exert influence on the threshold
of its partner measure, making its
threshold more strict.
58. @shawnmjones @WebSciDL
Improving the OTMT
Bug fixes
Make LSI scores reproducible
New Measures
TimeMap Measures – compare first memento with considered memento:
Spamsum of the raw content – used by Andy Jackson at the UKWA
Cosine of LDA Vectors via Gensim
Collection Measures
1. Develop a collection-wide picture
2. Compare each memento against that picture
Control over preprocessing:
Options to change use a different boilerplate method
Options to turn off stemming or stop word removal
58
60. @shawnmjones @WebSciDL
Motivation - Mementos Can Go Off-Topic
60
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to some
research activities, like
summarization
Web Page Gone
Account Suspension
61. @shawnmjones @WebSciDL
OTMT supports different similarity measures
with thresholds established based on experimentation
Byte count
Word count
Jaccard distance
Sørensen-Dice distance
Simhash of term frequencies
Simhash of raw memento content
Cosine similarity of TF-IDF vectors
Cosine similarity of LSI vectors
61