SlideShare a Scribd company logo
1 of 62
The Off-Topic Memento
Toolkit
Shawn M. Jones Michele C. Weigle Michael L. Nelson
Old Dominion University
Web Science and Digital Libraries Research Group
@WebSciDL
sjone@cs.odu.edu
@shawnmjones
mweigle@cs.odu.edu
@weiglemc
mln@cs.odu.edu
@phonedude_mln
Thanks to:
@shawnmjones @WebSciDL
Many Curators Use Archive-It To Create Web Archive
Collections
2
Archive-It makes it easy for curators to build collections and supply metadata
for a collection.
@shawnmjones @WebSciDL
When Building A Web Archive Collection…
 Curators select web resources
as seeds
 Each version of a seed
becomes a memento
3
@shawnmjones @WebSciDL
When Building A Web Archive Collection…
 Curators select web resources
as seeds
 Each version of a seed
becomes a memento
 They create a web archive
collection with a purpose in
mind
4
@shawnmjones @WebSciDL
When Researchers Prepare to Analyze a Web Archive
Collection…
5
Some collections have thousands of seeds.
Remember: Each seed has one or more
mementos.
The sheer number of mementos to process
means that researchers will need to quickly
identify mementos with low information value.
Off-topic mementos have low information value.
We want to identify, not delete, these for further
decision-making.
We identify them to not consider them for selection
as exemplars for storytelling.
81,014 seeds
486,227 seed mementos
@shawnmjones @WebSciDL
How Can Mementos Go Off-Topic?
6
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
For Technical Reasons
7
http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/
http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Page Gone
8
http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/
http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Content Drift – A Change in Languages
9
http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/
http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Server Maintenance
10
http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/
http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Account Suspension
11
http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php
http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Site Redesign
12
http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/
http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
Change Site Ownership
13
http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php
http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Was Hacked
14
http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/
http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
@shawnmjones @WebSciDL
Mementos in a Collection Can Go Off-Topic:
The Site Moves On From The Topic
15
http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/
http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
@shawnmjones @WebSciDL
Presenting the Off-Topic Memento Toolkit (OTMT)
a tool for identifying these off-topic mementos
16
@shawnmjones @WebSciDL
The Off-Topic Memento Toolkit (OTMT)
 Currently in alpha status, the
OTMT
 Accepts a collection of mementos
 Executes similarity measures on
those mementos
 Rates them as on or off-topic
 Identifies, does not delete, off-
topic mementos
17
https://github.com/oduwsdl/off-topic-memento-toolkit
@shawnmjones @WebSciDL
Background and Related Work
18
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
19
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
20
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Similarity Measures for Documents
21
Manku (2007)
Sorensen (1948)
Dice (1945)
Jaccard (1912)
Simhash
Charikar (2002)
Sørensen-Dice
Coefficient
Jaccard Index
Hajishirzi
(2010)
Cosine Similarity of
TF-IDF Vectors
Cosine Similarity of
Latent Semantic
Indexing Vectors
Deerweister (1990)
OTMT supports these similarity measures
Adar (2009)
Sivakumar (2015) Řehůřek
(2011)
Like these studies,
we also use these
similarity measures
on mementos
Content Drift in
Web Archives
Jones (2016)
Zittrain (2014)
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
22
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
23
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
24
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
@shawnmjones @WebSciDL
Related Work – Other Methods of Off-Topic Detection
25
AlNoamany
(2016)
Latent Dirichlet
Allocation
Blei (2003)
Browser
Thumbnails of
Mementos
AlSum (2012)
Off-Topic
Analysis
We build on AlNoamany’s work to
bring you the Off-Topic Memento
Toolkit
Topic modeling
should help us
find off-topic
documents,
but which
cluster is off-
topic?
It is costly to
manually
review
browser
thumbnails to
find off-topic
mementos
@shawnmjones @WebSciDL
Memento Protocol Terminology
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";
datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";
datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";
datetime="Wed, 21 Jun 2000 04:41:56 GMT"
…
26
Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento-
datetimes.
URI-T: a URI for a TimeMap
URI-M: a URI for a memento
@shawnmjones @WebSciDL
Web Archives Augment Their Mementos
27
Banners Rewritten Links
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
@shawnmjones @WebSciDL
The OTMT Uses Raw Mementos
28
Raw mementos are free of these augmentations.
Archive-It and the Internet Archive provide access
to raw mementos at special URIs.
The OTMT finds these raw mementos and uses
them in its similarity comparisons.
http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
@shawnmjones @WebSciDL
The OTMT Performs Preprocessing
29
<p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally
to ensure justice for women and an independent and effective International Criminal Court.</p>
['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for',
'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.']
Tokenization
Remove stop words
['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent',
'effective', 'International', 'Criminal', 'Court']
Stemming
['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern',
'crimin', 'court']
Boilerplate removal
The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent
and effective International Criminal Court.
@shawnmjones @WebSciDL
We Evaluated the OTMT with a Gold Standard Dataset
 In “Detecting off-topic pages within
TimeMaps in Web archives”,
AlNoamany performed a study to
detect off-topic Mementos
 The mementos were manually
marked as on or off-topic
 We reuse this dataset in our
evaluation
30
https://github.com/oduwsdl/offtopic-goldstandard-data
@shawnmjones @WebSciDL
TimeMap Measures Supported by OTMT
31
@shawnmjones @WebSciDL
General algorithm
 For each TimeMap in a collection
1. Get the first memento
2. Preprocess it
3. For each memento in the TimeMap
1. Get the memento
2. Preprocess it
3. Compute the similarity to the first
memento using a given measure
4. Save the score
5. A threshold value determines if a
memento is on or off-topic
32
First
memento
Considered
memento
@shawnmjones @WebSciDL
Structural Measures – Byte Count and Word Count
33
On-topic:
9599 bytes
183 words (after preprocessing)
Off-topic:
401 bytes
22 words (after preprocessing)
Off-topic mementos tend to have less bytes/words
Scores range from 0 to -1
@shawnmjones @WebSciDL
Set Operation Measures
34
Jaccard Distance Sørensen-Dice Distance
Size of Intersection over size of union Twice the size of
intersection over size of
both sets
Scores range from 0 to 1
['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda',
'democrat', 'republ', 'congo', 'libya']
['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat',
'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya',
'kyrgyzstan']
Highlighted words are the intersection
Words from
Doc #1:
Words from
Doc #2:
@shawnmjones @WebSciDL
Simhash of Term Frequencies
35
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('republ', 2),
('human', 1),
…
13221438115839111206 13797903006343525414
('women', 4),
('justic', 4),
('’', 3),
('gender', 3),
('initi', 2),
('intern', 2),
('icc', 2),
('work', 2),
('human', 1),
('right', 1),
…
6 bits
Scores range from 0 to 64 bits
Simhash Distance:
Simhash of Terms and Frequencies
from Document #1:
Simhash of Terms and Frequencies
from Document #2:
@shawnmjones @WebSciDL
Simhash of raw content
36
The Women’s Initiatives for Gender Justice is an international women’s human
rights organisation that advocates for gender justice through the International
Criminal Court (ICC) and through domestic mechanisms, including peace
negotiations and justice processes.We work with women and communities
most affected by the armed conflict with a focus on countries with situations
under investigation by the ICC. The Women’s Initiatives for Gender Justice
currently works in Uganda, the Democratic Republic of the Congo and Libya.
The Women’s Initiatives for Gender Justice is an international women’s
human rights organisation that advocates for gender justice through the
International Criminal Court (ICC) and through domestic mechanisms,
including peace negotiations and justice processes. We work with women
most affected by the conflict situations under investigation by the ICC. The
Women’s Initiatives for Gender Justice works in Uganda, the Democratic
Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya
and Kyrgyzstan.
12358429319379250844 12359555184926328508
6 bits
Scores range from 0 to 64 bits
Simhash of Document #1: Simhash of Document #2:
Simhash Distance:
@shawnmjones @WebSciDL
Cosine Similiarities
37
Take the cosine of the document vectors.
Cosine of TF-IDF
Vectors are formed from each document and their term frequencies.
Cosine of Latent Semantic Indexing (LSI)
Each vector is informed by LSI.
Scores range from 1 to 0.
@shawnmjones @WebSciDL
Using the OTMT
38
@shawnmjones @WebSciDL
OTMT Installation Options
1. Pip from Pypi (preferred):
pip install otmt
2. Experimental Docker Image:
docker pull shawnmjones/otmt
3. Source Code:
git clone https://github.com/oduwsdl/off-topic-memento-
toolkit.git
39
@shawnmjones @WebSciDL
OTMT Usage
40
# detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json
Input Types for -i:
• timemap – followed by 1 or more
TimeMap URIs, separated by
commas
• warc – followed by 1 or more
WARC files, separated by
commas
• archiveit – followed by an
Archive-It collection ID
TimeMap measures for -tm:
• bytecount
• wordcount
• jaccard
• sorensen
• simhash-tf
• simhash-raw
• cosine
• gensim_lsi
Input
OutputMeasure
Output types for -ot:
• json
• csv
@shawnmjones @WebSciDL
OTMT Output - JSON
41
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
URI-T of TimeMap
URI-M of Memento
@shawnmjones @WebSciDL
OTMT Output - JSON
42
"http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": {
"http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": {
"timemap measures": {
"cosine": {
"stemmed": true,
"tokenized": true,
"removed boilerplate": true,
"comparison score": 0.10969941307631487,
"topic status": "off-topic”
},
"bytecount": {
"stemmed": false,
"tokenized": false,
"removed boilerplate": false,
"comparison score": 0.15971409055425445,
"topic status": "on-topic"
}
},
"overall topic status": "off-topic"
}, ...
URI-T of TimeMap
URI-M of Memento
Measure
Information
Preprocessing status
Measure Score On or off topic
status by measure
On or off topic
status overall
If one measure scores as off-topic, the memento is considered off-topic
@shawnmjones @WebSciDL
Supported Similarity Measures
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
43
@shawnmjones @WebSciDL
Establishing Reasonable Defaults
44
@shawnmjones @WebSciDL
Experiment setup
 For each measure:
1. Start the threshold at the score of
complete dissimilarity
2. Test with the URI-Ms from the gold
standard data set as if that threshold
indicated off-topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold
5. Repeat 2 – 4 until the threshold
matches complete equivalence score
45
 Example using Byte Count:
1. Start threshold at -1
2. Test with the URI-Ms from the gold
standard data set as if -1 indicated off-
topic
3. Compute F1 using real off-topic status
of the memento from the gold standard
data
4. Increment the threshold to -0.99
5. Test with the URI-Ms from the gold
standard data set as if -0.99 indicated
off-topic
6. Compute F1 with real status
7. Increment to -0.98
8. Repeat until the threshold is 0
@shawnmjones @WebSciDL
Our results do not match AlNoamany’s, but the world is
not the same as it was in 2015…
AlNoamany’s Study Our Study
Year Conducted 2015 2017
Boilerplate Removal Boilerpipe (Java) Justext
Tokenization and Stemming Scikit-learn NLTK
46
Other changes:
• Download errors
• Gold Standard Dataset updates
@shawnmjones @WebSciDL
Simhash of Term Frequencies
47
Our Results:
AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Simhash of raw memento
48
Our Results: AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Sørensen-Dice Distance Results
49
Our Results: AlNoamany’s Results
Not tested
@shawnmjones @WebSciDL
Jaccard Distance Results
50
Our Results: AlNoamany’s Results
Best F1 Score: 0.538
Threshold: 0.95
@shawnmjones @WebSciDL
Cosine Similarity of LSI Vectors
51
AlNoamany’s Results
Not tested
Our Results:
Note: LSI scores are
non-deterministic
@shawnmjones @WebSciDL
Byte Count Results
52
AlNoamany’s Results
Best F1 Score: 0.584
Threshold: -0.65
Our Results:
@shawnmjones @WebSciDL
Cosine Similarity of TF-IDF Vectors
53
Our Results: AlNoamany’s Results
Best F1 Score: 0.881
Threshold: 0.15
Best score in AlNoamany’s Results
@shawnmjones @WebSciDL
Word Count Results
54
Best Score in Our Results: AlNoamany’s Results
Best F1 Score: 0.806
Threshold: -0.85
@shawnmjones @WebSciDL
Results Summarized – Best F1 Score is Word Count
55
AlNoamany's Results Results of this study
Similarity Measure
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Best F1
Score
Corresponding
Accuracy
Corresponding
Threshold
Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7
Cosine Similarity of TF-IDF
Vectors 0.881 0.983 0.15 0.766 0.965 0.12
Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39
Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics
Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94
Sørensen-Dice Distance Not tested 0.649 0.953 0.88
Simhash on raw memento
content Not tested 0.578 0.934 25
Simhash on TF Not tested 0.523 0.942 28
Our word count measure came out ahead of AlNoamany’s.
AlNoamany’s Cosine Similarity measure came out ahead of ours.
@shawnmjones @WebSciDL
What about using measures together?
56
AlNoamany found that using cosine
similarity of TF-IDF vectors and word
count together produced even better
results.
Our best F1 score for word count
alone was 0.788.
Word count combined with LSI
turned out to be slightly better with
the same Accuracy.
The success of word count appears
to exert influence on the threshold
of its partner measure, making its
threshold more strict.
@shawnmjones @WebSciDL
The Future of OTMT
57
@shawnmjones @WebSciDL
Improving the OTMT
 Bug fixes
 Make LSI scores reproducible
 New Measures
 TimeMap Measures – compare first memento with considered memento:
 Spamsum of the raw content – used by Andy Jackson at the UKWA
 Cosine of LDA Vectors via Gensim
 Collection Measures
1. Develop a collection-wide picture
2. Compare each memento against that picture
 Control over preprocessing:
 Options to change use a different boilerplate method
 Options to turn off stemming or stop word removal
58
@shawnmjones @WebSciDL
Conclusion
59
@shawnmjones @WebSciDL
Motivation - Mementos Can Go Off-Topic
60
Hacked
Moved on from topic
Collections have a theme
Seeds are selected to
support that theme
Mementos are versions of
seeds
Some of these versions are
off-topic
Identifying these off-topic
mementos is key to some
research activities, like
summarization
Web Page Gone
Account Suspension
@shawnmjones @WebSciDL
OTMT supports different similarity measures
with thresholds established based on experimentation
 Byte count
 Word count
 Jaccard distance
 Sørensen-Dice distance
 Simhash of term frequencies
 Simhash of raw memento content
 Cosine similarity of TF-IDF vectors
 Cosine similarity of LSI vectors
61
@shawnmjones @WebSciDL
Please try out the Off-Topic Memento Toolkit!
62
Thanks to:
1. Pip (preferred):
pip install otmt
2. Experimental Docker Image:
docker pull shawnmjones/otmt
3. Source Code:
git clone
https://github.com/oduwsdl/off-topic-
memento-toolkit.git
https://github.com/oduwsdl/off-topic-memento-toolkit
https://github.com/oduwsdl/offtopic-goldstandard-data

More Related Content

What's hot

Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsShawn Jones
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesShawn Jones
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesShawn Jones
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web ArchivesMichael Nelson
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media StoriesYasmin AlNoamany, PhD
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItMichele Weigle
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesMichael Nelson
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple ArchivesMichael Nelson
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingMichael Nelson
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...Alexander Nwala
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItMichele Weigle
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Yasmin AlNoamany, PhD
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerWiLS
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 

What's hot (20)

Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
The Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web ArchivingThe Memento Protocol and Research Issues With Web Archiving
The Memento Protocol and Research Issues With Web Archiving
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
Using Web Archives to Enrich the Live Web Experience Through Storytelling - P...
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
Virtual Libraries
Virtual LibrariesVirtual Libraries
Virtual Libraries
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 

Similar to The Off-Topic Memento Toolkit

How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Gaurav Vaidya
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Using a Wiki for Collaboration and Coordination
Using a Wiki for Collaboration and CoordinationUsing a Wiki for Collaboration and Coordination
Using a Wiki for Collaboration and CoordinationConnie Crosby
 
Web 2.0 for Archivists, Powerpoint Version
Web 2.0 for Archivists, Powerpoint VersionWeb 2.0 for Archivists, Powerpoint Version
Web 2.0 for Archivists, Powerpoint VersionArian Ravanbakhsh
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesNelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesReynolds Journalism Institute (RJI)
 
Something about links
Something about linksSomething about links
Something about linksRoderic Page
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project ClinicWiLS
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
Next Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 CalhounNext Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 CalhounKaren S Calhoun
 
Web Information Systems Lecture 1: Introduction
Web Information Systems Lecture 1: IntroductionWeb Information Systems Lecture 1: Introduction
Web Information Systems Lecture 1: IntroductionKatrien Verbert
 
Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014PrattSILS
 
Tagging and Folksonomies
Tagging and FolksonomiesTagging and Folksonomies
Tagging and Folksonomieshhunt75
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Digital storytelling microsoft
Digital storytelling microsoftDigital storytelling microsoft
Digital storytelling microsoftthinkict
 

Similar to The Off-Topic Memento Toolkit (20)

How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Using a Wiki for Collaboration and Coordination
Using a Wiki for Collaboration and CoordinationUsing a Wiki for Collaboration and Coordination
Using a Wiki for Collaboration and Coordination
 
Web 2.0 for Archivists, Powerpoint Version
Web 2.0 for Archivists, Powerpoint VersionWeb 2.0 for Archivists, Powerpoint Version
Web 2.0 for Archivists, Powerpoint Version
 
One Big Library
One Big LibraryOne Big Library
One Big Library
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesNelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
 
Something about links
Something about linksSomething about links
Something about links
 
Digital Project Clinic
Digital Project ClinicDigital Project Clinic
Digital Project Clinic
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
Next Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 CalhounNext Generation Technical Services May 2009 Calhoun
Next Generation Technical Services May 2009 Calhoun
 
Web Information Systems Lecture 1: Introduction
Web Information Systems Lecture 1: IntroductionWeb Information Systems Lecture 1: Introduction
Web Information Systems Lecture 1: Introduction
 
Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014
 
Tagging and Folksonomies
Tagging and FolksonomiesTagging and Folksonomies
Tagging and Folksonomies
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Digital storytelling microsoft
Digital storytelling microsoftDigital storytelling microsoft
Digital storytelling microsoft
 
OpenGLAM: LOD and American Art
OpenGLAM: LOD and American ArtOpenGLAM: LOD and American Art
OpenGLAM: LOD and American Art
 

More from Shawn Jones

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...Shawn Jones
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Shawn Jones
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...Shawn Jones
 
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Shawn Jones
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsShawn Jones
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)Shawn Jones
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoShawn Jones
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestShawn Jones
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentShawn Jones
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wikiShawn Jones
 

More from Shawn Jones (12)

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
 
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...Improving Collection Understanding For Web Archives With Storytelling: Shinin...
Improving Collection Understanding For Web Archives With Storytelling: Shinin...
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
 
Reference Rot
Reference RotReference Rot
Reference Rot
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonest
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven Development
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wiki
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

The Off-Topic Memento Toolkit

  • 1. The Off-Topic Memento Toolkit Shawn M. Jones Michele C. Weigle Michael L. Nelson Old Dominion University Web Science and Digital Libraries Research Group @WebSciDL sjone@cs.odu.edu @shawnmjones mweigle@cs.odu.edu @weiglemc mln@cs.odu.edu @phonedude_mln Thanks to:
  • 2. @shawnmjones @WebSciDL Many Curators Use Archive-It To Create Web Archive Collections 2 Archive-It makes it easy for curators to build collections and supply metadata for a collection.
  • 3. @shawnmjones @WebSciDL When Building A Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento 3
  • 4. @shawnmjones @WebSciDL When Building A Web Archive Collection…  Curators select web resources as seeds  Each version of a seed becomes a memento  They create a web archive collection with a purpose in mind 4
  • 5. @shawnmjones @WebSciDL When Researchers Prepare to Analyze a Web Archive Collection… 5 Some collections have thousands of seeds. Remember: Each seed has one or more mementos. The sheer number of mementos to process means that researchers will need to quickly identify mementos with low information value. Off-topic mementos have low information value. We want to identify, not delete, these for further decision-making. We identify them to not consider them for selection as exemplars for storytelling. 81,014 seeds 486,227 seed mementos
  • 6. @shawnmjones @WebSciDL How Can Mementos Go Off-Topic? 6
  • 7. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: For Technical Reasons 7 http://wayback.archive-it.org/1068/20130306212205/http://bo.amnesty.org/ http://wayback.archive-it.org/1068/20120303011104/http://bo.amnesty.org/
  • 8. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Page Gone 8 http://wayback.archive-it.org/1068/20101221161732/http://www.acdauk.org.uk/ http://wayback.archive-it.org/1068/20110902210644/http://www.acdauk.org.uk/
  • 9. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Content Drift – A Change in Languages 9 http://wayback.archive-it.org/1068/20130306231537/http://ecwronline.org/ http://wayback.archive-it.org/1068/20110129043404/http://ecwronline.org/
  • 10. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Server Maintenance 10 http://wayback.archive-it.org/1068/20111202210620/http://amnestyghana.org/ http://wayback.archive-it.org/1068/20120302232416/http://amnestyghana.org/
  • 11. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Account Suspension 11 http://wayback.archive-it.org/1068/20110317151735/http://amnestymauritius.org/french/news.php http://wayback.archive-it.org/1068/20111202210625/http://amnestymauritius.org/cgi-sys/suspendedpage.cgi
  • 12. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Site Redesign 12 http://wayback.archive-it.org/1068/20120302224302/http://ombuds.am/main/ http://wayback.archive-it.org/1068/20100510173253/http://ombuds.am/main
  • 13. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: Change Site Ownership 13 http://wayback.archive-it.org/1068/20090210190543/http://www.afapredesa.org/index.php http://wayback.archive-it.org/1068/20120302210439/http://www.afapredesa.org/
  • 14. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: The Site Was Hacked 14 http://wayback.archive-it.org/2950/20120327032244/http://occupyevansville.org/ http://wayback.archive-it.org/2950/20120410032628/http://occupyevansville.org/
  • 15. @shawnmjones @WebSciDL Mementos in a Collection Can Go Off-Topic: The Site Moves On From The Topic 15 http://wayback.archive-it.org/2358/20120803140009/http://www.bbc.co.uk/news/world/middle_east/ http://wayback.archive-it.org/2358/20110202225040/http://www.bbc.co.uk/news/world/middle_east/
  • 16. @shawnmjones @WebSciDL Presenting the Off-Topic Memento Toolkit (OTMT) a tool for identifying these off-topic mementos 16
  • 17. @shawnmjones @WebSciDL The Off-Topic Memento Toolkit (OTMT)  Currently in alpha status, the OTMT  Accepts a collection of mementos  Executes similarity measures on those mementos  Rates them as on or off-topic  Identifies, does not delete, off- topic mementos 17 https://github.com/oduwsdl/off-topic-memento-toolkit
  • 19. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 19 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 20. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 20 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 21. @shawnmjones @WebSciDL Related Work – Similarity Measures for Documents 21 Manku (2007) Sorensen (1948) Dice (1945) Jaccard (1912) Simhash Charikar (2002) Sørensen-Dice Coefficient Jaccard Index Hajishirzi (2010) Cosine Similarity of TF-IDF Vectors Cosine Similarity of Latent Semantic Indexing Vectors Deerweister (1990) OTMT supports these similarity measures Adar (2009) Sivakumar (2015) Řehůřek (2011) Like these studies, we also use these similarity measures on mementos Content Drift in Web Archives Jones (2016) Zittrain (2014)
  • 22. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 22 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis
  • 23. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 23 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic?
  • 24. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 24 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  • 25. @shawnmjones @WebSciDL Related Work – Other Methods of Off-Topic Detection 25 AlNoamany (2016) Latent Dirichlet Allocation Blei (2003) Browser Thumbnails of Mementos AlSum (2012) Off-Topic Analysis We build on AlNoamany’s work to bring you the Off-Topic Memento Toolkit Topic modeling should help us find off-topic documents, but which cluster is off- topic? It is costly to manually review browser thumbnails to find off-topic mementos
  • 26. @shawnmjones @WebSciDL Memento Protocol Terminology <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento"; datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento"; datetime="Wed, 21 Jun 2000 04:41:56 GMT" … 26 Each seed, or original resource, has a corresponding TimeMap listing all of that seed’s mementos and capture times, their memento- datetimes. URI-T: a URI for a TimeMap URI-M: a URI for a memento
  • 27. @shawnmjones @WebSciDL Web Archives Augment Their Mementos 27 Banners Rewritten Links http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  • 28. @shawnmjones @WebSciDL The OTMT Uses Raw Mementos 28 Raw mementos are free of these augmentations. Archive-It and the Internet Archive provide access to raw mementos at special URIs. The OTMT finds these raw mementos and uses them in its similarity comparisons. http://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  • 29. @shawnmjones @WebSciDL The OTMT Performs Preprocessing 29 <p class=“homepage-description”>The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.</p> ['The', 'Women', '’', 's', 'Initiatives', 'for', 'Gender', 'Justice', 'works', 'globally', 'to', 'ensure', 'justice', 'for', 'women', 'and', 'an', 'independent', 'and', 'effective', 'International', 'Criminal', 'Court', '.'] Tokenization Remove stop words ['Women', '’', 'Initiatives', 'Gender', 'Justice', 'works', 'globally', 'ensure', 'justice', 'women', 'independent', 'effective', 'International', 'Criminal', 'Court'] Stemming ['women', '’', 'initi', 'gender', 'justic', 'work', 'global', 'ensur', 'justic', 'women', 'independ', 'effect', 'intern', 'crimin', 'court'] Boilerplate removal The Women’s Initiatives for Gender Justice works globally to ensure justice for women and an independent and effective International Criminal Court.
  • 30. @shawnmjones @WebSciDL We Evaluated the OTMT with a Gold Standard Dataset  In “Detecting off-topic pages within TimeMaps in Web archives”, AlNoamany performed a study to detect off-topic Mementos  The mementos were manually marked as on or off-topic  We reuse this dataset in our evaluation 30 https://github.com/oduwsdl/offtopic-goldstandard-data
  • 32. @shawnmjones @WebSciDL General algorithm  For each TimeMap in a collection 1. Get the first memento 2. Preprocess it 3. For each memento in the TimeMap 1. Get the memento 2. Preprocess it 3. Compute the similarity to the first memento using a given measure 4. Save the score 5. A threshold value determines if a memento is on or off-topic 32 First memento Considered memento
  • 33. @shawnmjones @WebSciDL Structural Measures – Byte Count and Word Count 33 On-topic: 9599 bytes 183 words (after preprocessing) Off-topic: 401 bytes 22 words (after preprocessing) Off-topic mementos tend to have less bytes/words Scores range from 0 to -1
  • 34. @shawnmjones @WebSciDL Set Operation Measures 34 Jaccard Distance Sørensen-Dice Distance Size of Intersection over size of union Twice the size of intersection over size of both sets Scores range from 0 to 1 ['women', '’', 'initi', 'gender', 'justic', 'current', 'work', 'uganda', 'democrat', 'republ', 'congo', 'libya'] ['women', '’', 'initi', 'gender', 'justic', 'work', 'uganda', 'democrat', 'republ', 'congo', 'sudan', 'central', 'african', 'republ', 'kenya', 'libya', 'kyrgyzstan'] Highlighted words are the intersection Words from Doc #1: Words from Doc #2:
  • 35. @shawnmjones @WebSciDL Simhash of Term Frequencies 35 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('republ', 2), ('human', 1), … 13221438115839111206 13797903006343525414 ('women', 4), ('justic', 4), ('’', 3), ('gender', 3), ('initi', 2), ('intern', 2), ('icc', 2), ('work', 2), ('human', 1), ('right', 1), … 6 bits Scores range from 0 to 64 bits Simhash Distance: Simhash of Terms and Frequencies from Document #1: Simhash of Terms and Frequencies from Document #2:
  • 36. @shawnmjones @WebSciDL Simhash of raw content 36 The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes.We work with women and communities most affected by the armed conflict with a focus on countries with situations under investigation by the ICC. The Women’s Initiatives for Gender Justice currently works in Uganda, the Democratic Republic of the Congo and Libya. The Women’s Initiatives for Gender Justice is an international women’s human rights organisation that advocates for gender justice through the International Criminal Court (ICC) and through domestic mechanisms, including peace negotiations and justice processes. We work with women most affected by the conflict situations under investigation by the ICC. The Women’s Initiatives for Gender Justice works in Uganda, the Democratic Republic of the Congo, Sudan, the Central African Republic, Kenya, Libya and Kyrgyzstan. 12358429319379250844 12359555184926328508 6 bits Scores range from 0 to 64 bits Simhash of Document #1: Simhash of Document #2: Simhash Distance:
  • 37. @shawnmjones @WebSciDL Cosine Similiarities 37 Take the cosine of the document vectors. Cosine of TF-IDF Vectors are formed from each document and their term frequencies. Cosine of Latent Semantic Indexing (LSI) Each vector is informed by LSI. Scores range from 1 to 0.
  • 39. @shawnmjones @WebSciDL OTMT Installation Options 1. Pip from Pypi (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic-memento- toolkit.git 39
  • 40. @shawnmjones @WebSciDL OTMT Usage 40 # detect_off_topic -i archiveit=7877 -tm jaccard=0.80,bytecount=-0.50 -o outputfile.json Input Types for -i: • timemap – followed by 1 or more TimeMap URIs, separated by commas • warc – followed by 1 or more WARC files, separated by commas • archiveit – followed by an Archive-It collection ID TimeMap measures for -tm: • bytecount • wordcount • jaccard • sorensen • simhash-tf • simhash-raw • cosine • gensim_lsi Input OutputMeasure Output types for -ot: • json • csv
  • 41. @shawnmjones @WebSciDL OTMT Output - JSON 41 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall URI-T of TimeMap URI-M of Memento
  • 42. @shawnmjones @WebSciDL OTMT Output - JSON 42 "http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/": { "http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/": { "timemap measures": { "cosine": { "stemmed": true, "tokenized": true, "removed boilerplate": true, "comparison score": 0.10969941307631487, "topic status": "off-topic” }, "bytecount": { "stemmed": false, "tokenized": false, "removed boilerplate": false, "comparison score": 0.15971409055425445, "topic status": "on-topic" } }, "overall topic status": "off-topic" }, ... URI-T of TimeMap URI-M of Memento Measure Information Preprocessing status Measure Score On or off topic status by measure On or off topic status overall If one measure scores as off-topic, the memento is considered off-topic
  • 43. @shawnmjones @WebSciDL Supported Similarity Measures Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 43
  • 45. @shawnmjones @WebSciDL Experiment setup  For each measure: 1. Start the threshold at the score of complete dissimilarity 2. Test with the URI-Ms from the gold standard data set as if that threshold indicated off-topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold 5. Repeat 2 – 4 until the threshold matches complete equivalence score 45  Example using Byte Count: 1. Start threshold at -1 2. Test with the URI-Ms from the gold standard data set as if -1 indicated off- topic 3. Compute F1 using real off-topic status of the memento from the gold standard data 4. Increment the threshold to -0.99 5. Test with the URI-Ms from the gold standard data set as if -0.99 indicated off-topic 6. Compute F1 with real status 7. Increment to -0.98 8. Repeat until the threshold is 0
  • 46. @shawnmjones @WebSciDL Our results do not match AlNoamany’s, but the world is not the same as it was in 2015… AlNoamany’s Study Our Study Year Conducted 2015 2017 Boilerplate Removal Boilerpipe (Java) Justext Tokenization and Stemming Scikit-learn NLTK 46 Other changes: • Download errors • Gold Standard Dataset updates
  • 47. @shawnmjones @WebSciDL Simhash of Term Frequencies 47 Our Results: AlNoamany’s Results Not tested
  • 48. @shawnmjones @WebSciDL Simhash of raw memento 48 Our Results: AlNoamany’s Results Not tested
  • 49. @shawnmjones @WebSciDL Sørensen-Dice Distance Results 49 Our Results: AlNoamany’s Results Not tested
  • 50. @shawnmjones @WebSciDL Jaccard Distance Results 50 Our Results: AlNoamany’s Results Best F1 Score: 0.538 Threshold: 0.95
  • 51. @shawnmjones @WebSciDL Cosine Similarity of LSI Vectors 51 AlNoamany’s Results Not tested Our Results: Note: LSI scores are non-deterministic
  • 52. @shawnmjones @WebSciDL Byte Count Results 52 AlNoamany’s Results Best F1 Score: 0.584 Threshold: -0.65 Our Results:
  • 53. @shawnmjones @WebSciDL Cosine Similarity of TF-IDF Vectors 53 Our Results: AlNoamany’s Results Best F1 Score: 0.881 Threshold: 0.15 Best score in AlNoamany’s Results
  • 54. @shawnmjones @WebSciDL Word Count Results 54 Best Score in Our Results: AlNoamany’s Results Best F1 Score: 0.806 Threshold: -0.85
  • 55. @shawnmjones @WebSciDL Results Summarized – Best F1 Score is Word Count 55 AlNoamany's Results Results of this study Similarity Measure Best F1 Score Corresponding Accuracy Corresponding Threshold Best F1 Score Corresponding Accuracy Corresponding Threshold Word Count 0.806 0.982 -0.85 0.788 0.971 -0.7 Cosine Similarity of TF-IDF Vectors 0.881 0.983 0.15 0.766 0.965 0.12 Byte Count 0.584 0.962 -0.65 0.756 0.965 -0.39 Cosine Similarity of LSI Vectors Not tested 0.711 0.965 0.12 with 10 topics Jaccard Distance 0.538 0.962 0.95 0.651 0.953 0.94 Sørensen-Dice Distance Not tested 0.649 0.953 0.88 Simhash on raw memento content Not tested 0.578 0.934 25 Simhash on TF Not tested 0.523 0.942 28 Our word count measure came out ahead of AlNoamany’s. AlNoamany’s Cosine Similarity measure came out ahead of ours.
  • 56. @shawnmjones @WebSciDL What about using measures together? 56 AlNoamany found that using cosine similarity of TF-IDF vectors and word count together produced even better results. Our best F1 score for word count alone was 0.788. Word count combined with LSI turned out to be slightly better with the same Accuracy. The success of word count appears to exert influence on the threshold of its partner measure, making its threshold more strict.
  • 58. @shawnmjones @WebSciDL Improving the OTMT  Bug fixes  Make LSI scores reproducible  New Measures  TimeMap Measures – compare first memento with considered memento:  Spamsum of the raw content – used by Andy Jackson at the UKWA  Cosine of LDA Vectors via Gensim  Collection Measures 1. Develop a collection-wide picture 2. Compare each memento against that picture  Control over preprocessing:  Options to change use a different boilerplate method  Options to turn off stemming or stop word removal 58
  • 60. @shawnmjones @WebSciDL Motivation - Mementos Can Go Off-Topic 60 Hacked Moved on from topic Collections have a theme Seeds are selected to support that theme Mementos are versions of seeds Some of these versions are off-topic Identifying these off-topic mementos is key to some research activities, like summarization Web Page Gone Account Suspension
  • 61. @shawnmjones @WebSciDL OTMT supports different similarity measures with thresholds established based on experimentation  Byte count  Word count  Jaccard distance  Sørensen-Dice distance  Simhash of term frequencies  Simhash of raw memento content  Cosine similarity of TF-IDF vectors  Cosine similarity of LSI vectors 61
  • 62. @shawnmjones @WebSciDL Please try out the Off-Topic Memento Toolkit! 62 Thanks to: 1. Pip (preferred): pip install otmt 2. Experimental Docker Image: docker pull shawnmjones/otmt 3. Source Code: git clone https://github.com/oduwsdl/off-topic- memento-toolkit.git https://github.com/oduwsdl/off-topic-memento-toolkit https://github.com/oduwsdl/offtopic-goldstandard-data