Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

Improving Collection Understanding
For Web Archives With Storytelling:
Shining Light Into
Dark and Stormy Archives
Shawn M. Jones
Los Alamos National Laboratory
Research Library Prototyping Team
Web Science and Digital Libraries Research Group
Old Dominion University
Dissertation Defense 2021/08/05
1
Thanks to:

@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
2

@shawnmjones @StormyArchives
November 8, 2019
3
During the second week of November 2019,
the National Center for Medical Intelligence shared
intelligence based on "monitoring of internal Chinese
communications" that warned of a potential novel
coronavirus pandemic coming out of Wuhan.
Source: https://en.wikipedia.org/wiki/Timeline_of_the_COVID-
19_pandemic_in_2019
COVID-19 was not named and was only
known to a small group in the US.
No news coverage existed.

December 16, 2019
4
The first documented COVID-19 hospital
admission was on December 16, 2019.
COVID-19 was still not well known and
received no news coverage.

January 13, 2020
5
One month later, CNN carries a coronavirus
category on its front page.

February 28, 2020
6
Another month goes by with more front-page
articles about coronavirus.

March 13, 2020
7
A month later, CNN had many front-page
articles about coronavirus with a special
Coronavirus heading for more articles.

March 20, 2020
8
A week later, states are locking down.

March 27, 2020
9
A week later, the US has the most cases of any
country.

A web archive
helped me tell
this story.
10
These mementos are stored
in the Internet Archive.
They are full captures of the
web code that existed on
those dates.

What other stories can we tell with web
archives?
11
Motivation and Research Questions

Natasha is studying how disasters shape
cultures...
12
Sources like Wikipedia now have a
summary of the event after the
fact.
Today she is
reviewing the
South Louisiana
Flood of 2016.
She wants to know about
the news reporting as it was
at the time of the event.

Per Nwala et al., news articles about the event tend to slide
down search results as we get further from the event.
13
Green = coverage of event
Red = Summaries of the event
A. C. Nwala, M. C. Weigle, and M. L. Nelson, “Scraping SERPs for Archival Seeds: It Matters
When You Start,” in ACM/IEEE JCDL, 2018. https://doi.org/10.1145/3197026.3197056.
She knows that
five years later,
it is harder to
find news
articles from
the event itself.

Natasha also knows that news articles are updated with
more current and correct information
14
She wants to
know about
the news
reporting as it
was at the time
of the event.
Today
8/14/2016
during event

Natasha knows that any time that we need proof
that X said Y at date D, we need web archives
15
She knows that
web archives
contain not just
“screenshots”
but full
captures of
web code as
mementos.
To start, she must know a
URL and capture
datetime.
Then she can view a
memento.
And she can review its
code, if needed.

Natasha also knows that archivists create
web archive collections based on a theme
16

With these themed collections, she can discover documents
that once existed and match her event or topic
17
Virginia Tech: Crisis, Tragedy, and
Recovery Network capturing
coverage of the 2011 Tucson Shootings
University of Utah capturing its
web presence over time

Natasha has discovered multiple sites with
themed web archive collections
18
Library of Congress
Archive-It
(by the Internet Archive)
Trove
Conifer
Each site has
different
capabilities and
different types of
collections.

Natasha chooses to look through the
themed collections at Archive-It
19
As a popular subscription service of the Internet
Archive, Archive-It helps archivists create themed
collections.
These collections consist of seeds.
Mementos are observations of a seed at different
points in time.
For each seed, there are multiple mementos.
This seed has 7 mementos (captured 7 times).

There are multiple collections about the
subject, which one should she work with?
20
This is not the only
disaster she is studying.
She needs to waste as
little time as possible.

@shawnmjones @StormyArchives 21
Natasha is not alone, 44
Archive-It collections
match the search query
“human rights”
How are they different
from each other?
Which one is best for
her needs?

Rustam needs to study how the Boston
Marathon Bombing unfolded…
22
Reviewing different
mementos of the
same seed allows
Rustam to
understand when
the public learned of
different events,
including when
misinformation was
corrected.
Rather than digging through collections manually, how can Rustam discover and view this more quickly?

Olayinka wants to understand what different
news sources revealed on the same day…
23
Today she is
trying to
understand the
different
reporting on the
September 11th
Attacks.
How can Olayinka discover and view this more quickly?

Elbert is an archivist who wants to promote his
collections, so others are aware of them…
24
He wants to help
visitors like
Natasha, Rustam,
and Olayinka
notice his
collections and use
them.
How does he create enticing visualizations that people can understand with minimal effort?

Ling is an archivist who inherited a collection from another
archivist, and she needs to understand it so she can make
decisions about it…
25
Her collection has
hundreds of
thousands of seeds.
Her predecessor did
not provide much
metadata with the
collection.
Archivists can add metadata to
collections, but many Archive-It
collections contain little metadata.
The more metadata a reader needs to
understand a collection, the less they
have available.

Ling knows she is not alone – the collections are often built
automatically, making it difficult to know what they contain
26
Web Archiving Technical Lead of the British Library
Ling knows that the
automation makes it
expensive to add
metadata to
thousands of
documents after
they are collected.

All these personas need a faster method of
collection understanding
27
Persona Natasha Rustam Olayinka Elbert Ling
Information
need
Quickly
compare
collections
Follow a source
over time
Understand a
time from
different
sources
Promote
collections and
help visitors
understand
them
Understand a
collection that
they inherited
Role Visitor Visitor Visitor Archivist Archivist
Understanding
needs
Overall
collection
Aspect (Page)
of a collection
Aspect (Time)
of a collection
Overall
collection
Overall
collection

All are faced with more than 14,000
collections at Archive-It alone
28
More than 14,000 collections exist as of the end of 2020
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections

The problem, summarized
29
§ There are multiple collections
about the same concept.
§ It is difficult to easily expose
aspects (e.g., time, page) of
collections.
§ The metadata for each collection
is non-existent, or inconsistently
applied.
§ Many collections have
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these mementos
for collection understanding is an
expensive proposition.

Our proposal: a visualization made of
exemplar mementos
30
§ Our visualization is a summary
that will act like an abstract
§ Pirolli and Card’s Information
Foraging Theory:
§ maximize the value of the
information gained from our
summaries
§ minimize the cost of interacting
with the collection
§ ensure that our exemplar
mementos have good
information scent
§ contain cues that the memento
will address a user’s needs
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20

Users already interact with pages like this
every day
31
A story on Wakelet about the 2021
Capitol Attack
A Twitter Moment of
astronaut Michael Collins
Twitter creates
Moments that
present surrogates
linking to content
about a topic of
interest.
Educators, librarians,
and others create
stories on Wakelet
about different
subjects.

Social media stories apply visualizations that
users already know how to understand
32
An individual surrogate summarizes a web resource.
When we combine surrogates into a story, we
summarize a topic.

We developed a five-process storytelling model based
on existing work on summarization and storytelling
33
exemplar
mementos
collection title: 2013 Boston
Marathon Bombing
collected by: Internet
Archive Global Events
collection URL
image data...
seed data...
top terms
top entities...
title: Boston Marathon
Explosions...
description: “The
grace this tragedy
exposed...”
striking image..
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
AlNoamany found
that popular stories
contain 28 elements,
so we have a target
of 28 exemplars.
AlNoamany
pioneered this work
combining web
archive collections
with Storify, but Storify
is now gone.

Our five-process storytelling model maps to
our research questions
34
RQ1: What types of web archive
collections exist and what are
their structural features?
RQ2: What approaches work
best for selecting exemplars
from web archive collections?
RQ3: What surrogates work best
for understanding groups of
mementos?
RQ4: What methods that
automate the creation of
surrogates produce results that
best match humans’ behavior?
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Examples and Use Cases for our
Personas

Our Dark and Stormy Archives Tools serve as a
reference implementation of our storytelling process
35

Outline
Questions
36

URIs identify resources
37
T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax”.
https://www.rfc-editor.org/rfc/rfc3986.txt, 2005.
Jacobs, I. and Walsh, N. eds., “Architecture of the World Wide Web, Vol. 1.”
https://www.w3.org/TR/webarch/, 2003.
URIs are a superset of identifiers that
contains URLs, URNs, etc.
Background and Related Work
URIs identify resources, which have
different representations
depending on the visitor’s needs.

HTML is the file format we use for web
resources
38
HTML contains links to other
pages, identified by URIs.

Web archives apply crawlers to quickly visit
pages and follow links to build collections
39
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.

40
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
HTTP/1.0 200 OK
date: Sat, 08 May 2021 16:04:56 GMT
<head>
8" />
...
memento.

41
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
HTTP/1.0 200 OK
date: Sat, 08 May 2021 16:04:56 GMT
<head>
8" />
...
memento.

42
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
HTTP/1.0 200 OK
date: Sat, 08 May 2021 16:04:56 GMT
<head>
8" />
...
memento.

A TimeMap gives us a listing of the
mementos available for an original resource
43
the original resource
“now”
<http://www.cs.odu.edu>;rel="original",
<https://web.archive.org/web/19970102130137/http://cs.odu.edu:80/>;rel="memento"; datetime="Thu, 02 Jan 1997
13:01:37 GMT",
<https://web.archive.org/web/19970606105039/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 06 Jun 1997
10:50:39 GMT",
<http://archive.md/19970606105039/http://www.cs.odu.edu/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT",
<https://web.archive.org/web/19971010201632/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 10 Oct 1997
20:16:32 GMT", <https://web.archive.org/web/19971211124211/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Thu,
11 Dec 1997 12:42:11 GMT",
...
<https://web.archive.org/web/19990502033600/http://cs.odu.edu:80/>;rel="memento"; datetime="Sun, 02 May 1999
03:36:00 GMT",
...
<https://arquivo.pt/wayback/20091223043049mp_/http://www.cs.odu.edu/>;rel="memento"; datetime="Wed, 23 Dec 2009
04:30:49 GMT",
...
memento from 1997
memento from 1999 memento from 2009
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.

Others have tackled portions of the problem of summarizing
web archives, but only AlNoamany addressed all processes
44
Some have conflated our
steps of generating
metadata and visualizing it.
Many have and continue to
focus on selecting exemplar
words, sentences, images,
video clips, and more for
summarization.
Those who have evaluated
surrogates in the past focused
on if the participant chose the
correct search engine result,
but not understanding.
Attempts to manually apply metadata
to these collections are impacted by
the scale of the problem.

AlNoamany identified the characteristics of social
media stories and Archive-It collections
45
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
By analyzing the characteristics
of stories and collections, she
determined that popular
stories contain 28 elements.
Our model maps to hers but
expands her visualize step.
AlNoamany’s
sieve diagram
gives us one
solution for
storytelling. We will
explore others.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.

Select
Exemplars
AlNoamany extracted some story metadata and relied on
Storify to create and distribute the resulting visualization.
46
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
https://doi.org/10.1145/3091478.3091508.

Her proof-of-concept generated some document
metadata and relied on Storify to generate the rest.
47
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
https://doi.org/10.1145/3091478.3091508.

She generated many different stories based on
exemplars selected by her proof-of-concept
48
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
https://doi.org/10.1145/3091478.3091508.

Through a user study, she demonstrated that participants
could tell the difference between her solution’s stories and
randomly generated stories
49
Participants could not tell the
difference between her
solution’s stories and those
generated by human archivists
https://doi.org/10.1145/3091478.3091508.

Unfortunately, her solution is difficult to generalize
50
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Adobe shut down
the Storify platform
in 2018.
AlNoamany’s POC
focused on Archive-It.

Outline
Questions
51

As collection users, what structural features
can we view from outside?
52
§ Using only structural features is
advantageous because it saves
one from having to download a
collection’s content.
§ These structural features give us
different insight than can be
provided by text analysis or
metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata

Was the collection built from web sites belonging
to one domain or many?
53
Many domains One domain
Structural feature
discussed here:
• domain diversity

Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
54
Top-level pages Deeper links
Structural feature
discussed here:
• path depth diversity
• most frequent path
depth

Growth curves provide some understanding of
collection curation behavior
55
• Skew of the
collection’s holdings
• Indicates
temporality of
collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)

We discovered four semantic categories in
56
Self-Archiving
54.1% of collections
Subject-based
Time Bounded – Expected
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
In a study of 3,382 Archive-It collections

Self-Archiving collections dominate Archive-It
57
27.6% 14.1%
Subject-based Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Organizations
archiving themselves
or those they are
responsible for.

Subject-based collections come in second
58
14.1%
Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Collections centered
on a subject that is
not ephemeral.
54.1%
Self-archiving

collections summarize events
we anticipate
59
In a study of 3,382
Time Bounded
– Spontaneous
4.2%
Collections about an
anticipated event.
54.1%
Self-archiving
27.6%
Subject-based

4.2% of collections
In a study of 3,382
Collections about an
unexpected event.
Some of these were
evaluated by AlNoamany.
54.1%
Self-archiving
27.6%
Subject-based
14.1%
Time Bounded
– Expected
collections summarize
unexpected events

We can bridge the structural to the
descriptive…
61
Self-Archiving
Subject-based
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720

RQ1: What types of web archive collections exist
and what structural features do they have?
62
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us some idea
of the curatorial involvement with
a collection over time.
When selecting exemplars, we
need to summarize the collection
in terms of time and topic. The
shapes of these growth curves
indicate how we might cluster in
time.
This example growth curve
shows us that 30% of the
seeds were added early in
the collection’s life.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage

Identifying off-topic mementos is key to choosing
exemplar mementos
63
Hacked
Moved on from topic
Collections have a topic.
Seeds are selected to
support that topic.
Mementos are
observations of seeds.
Some of these versions are
off-topic.
Excluding these off-topic
mementos from
consideration is key to
selecting exemplars.
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87

We found that Word Count had the best F1
score for identifying off-topic mementos
64
We reused AlNoamany’s
labeled dataset.
She did not try:
• Sorensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of AlNoamany’s.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting oﬀ-topic pages within TimeMaps in Web
archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
Y. AlNoamany and S. M. Jones, “Off-Topic Gold Standard Dataset,” GitHub. 2018.
https://github.com/oduwsdl/offtopic-goldstandard-data

Filtering off-topic mementos is just one step in a set of
algorithmic primitives for selecting exemplars
65
We can filter the collection to get a
good set of exemplars and then
randomly sample from the remainder.

Ordering allows us to create meaning from a
list of mementos
66
We can order the collection by some
feature and then systematically sample
every jth memento from the remainder.

filter
include-only mementos
containing a given pattern
Web
archive
collection
exemplars
reduces number of pages to consider
intention of steps
order by descending score
order
score scores results
score results with BM25 scores results
orders results
Scoring gives us an idea of how well a memento meets
the information needs represented by a function
67
We can combine filter, score, and order
to create a simple search engine.

Clustering based on a feature allows us to
imbue subsets of mementos with meaning
68
With these primitives, we can
reproduce AlNoamany’s Algorithm
which we will now call DSA1.

These primitives allow us to create other algorithms for
selecting exemplars that tell the story the user desires
69
DSA2 focuses on representing collection
growth curves and scoring mementos
by their surrogate metadata.
DSA3 focuses on mementos that best
match the collection topic.
DSA4 focuses on finding the most novel
mementos in the collection.

Search engines are the de-facto method of exploring
collections; if we consider them a baseline, then how
retrievable are the exemplars produced by DSA algorithms?
70
We loaded 8
different Archive-It
collections into
different instances
of the SolrWayback
web archive
search engine.
We also executed
4 different DSA
algorithms to
produce exemplars
from these
collections.
Web
archive
collection
exemplars
Web
archive
collection
exemplars
exemplars
Web
archive
collection
Web
archive
collection
exemplars

We then generated queries with four different methods based on
the content of the exemplars produced by each DSA algorithm
71

We visualized the percentage of exemplars
that were never retrieved by any query
72
x-axis
the number of search results to
review before we find the exemplar
y-axis
the percentage
of exemplars that
have zero
retrievability
In this graph, we are reporting
zero retrievability with:
• queries from doc2query-T5
• for exemplars chosen by DSA3
At 10 search results,
57.82% of the exemplars
were not retrieved.
After 1000 search results,
36.05% of the exemplars
were not retrieved.

For all query methods the DSA algorithms’ exemplars
have similar retrievability
73

For all query methods the DSA algorithms’ exemplars
have similar retrievability
74
If all pages are relevant, then DSA algorithms produce mementos with more novelty than standard query
methods can with a state-of-the-art web archive search engine.
DSA4 was designed to surface more novel mementos and meets its goal in these results.

RQ2: Which approaches work best for selecting
exemplars from web archive collections?
75
We established that four diﬀerent
algorithms produced from these
primitives will select exemplars
that were not retrievable using
standard query methods and a
state-of-the-art web archive
search engine.
Removing off-topic mementos is but one step
toward selecting exemplars.
We devised a set of primitives for creating
many different types of sampling algorithms
that consider structural features.
An important step in
selecting exemplars to
summarize the collection is
identifying off-topic
mementos. We found that
word count differences
work best.

We implemented these primitives as part of Hypercane
76
Hypercane was used to
conduct the experiments in
this section.
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson. 2021. Hypercane: Intelligent Sampling for Web
Archive Collections. In ACM/IEEE JCDL 2021. [to be published in September 2021]

Outline
Questions
77

We evaluated 55 platforms in 2017 and found that existing social
platforms do not reliably produce surrogates for mementos
78
Generating Document Metadata
If we cannot rely upon the service to generate a surrogate, we will have to create
our own.
Which surrogate works best for understanding web archive collections?
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00137

We reused exemplars that archivists had selected to
describe their own collections to create stories with
different surrogates...
79

Archive-It like surrogates visualize these
mementos as they are on Archive-It
80
Archive-It like surrogate
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.

Browser thumbnails are screenshots of the page in a
browser
81
Browser thumbnails
Browser thumbnails
are a popular
surrogate type used
at web archives.
This is a screenshot of the exemplars
selected by the archivists of the
Archive-It collection

Social cards come from social media
platforms
82
Social cards
Social cards are a type
of surrogate typically
found on social media
platforms like
Facebook or Twitter.
These social cards were
specially designed to
include information
from web archives.
collection

sc/t combines social cards and thumbnails
83
sc/t
We replaced the
striking image of the
social card with a
browser thumbnail.
collection

sc+t places the social card to the left and a thumbnail
to the right
84
sc+t
Our thought was that
more information was
better.
collection

sc^t is interactive
85
sc^t
When a user hovers
over the striking image
the browser thumbnail
appears.
This provides both types
of surrogates in a smaller
space.
collection

We then presented these stories to Mechanical Turk (MT)
participants
86
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image (sc/t)
Social Card
With Thumbnail
to Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 URI-Ms selected by human
Archive-It curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story

And then asked them which of the following come from
the same collection…
87
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This process is like the Sentence Verification Task used in reading comprehension studies.

Social cards probably outperform the Archive-It
surrogate for participant’s correct answers
88
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770

Social cards produced less interaction while participants
viewed their stories
89
We measured clicks and hovers by participants while they were viewing their stories.
For browser thumbnails alone, most of the participants clicked the link to view the actual
memento behind the surrogate.

RQ3: What surrogates work best for
understanding groups of mementos?
90
Correct answers per surrogate indicate that social cards
probably outperform the Archive-It surrogate
• 4 stories of 15-17 mementos selected by human curators from their own collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• Each given 30 seconds to view a story, then asked a question
From a user study with
120 Mechanical Turk
participants:
With social cards, users were able to
correctly answer our questions without as
much interaction.

Social cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twitter:title
-or-
<title>
og:description
-or-
twitter:description
-or-
description
og:image
-or-
twitter:image
Without twitter:card and og:title or twitter:title, Twitter gives up and does not generate a card.
Facebook parses the <title> and produces a card with just a title.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899. 91
What do we do if this
metadata does not exist?

We analyzed 277,724 news articles captured by the Internet
Archive from 1998 to 2016, and found different rates
of metadata adoption
OGP = Open Graph Protocol
Facebook Cards
150 billion
documents in the
Internet Archive
were captured
before 2010 and
thus have no card
metadata
92
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]

By applying author behavior, we can
generate descriptions
93
We used the existing field values, written by page authors, as
ground truth data.
It tells us that authors tend to write card descriptions that have
the following lengths:
• 268 characters
• 52 words
• 2 sentences
We can use this length as input to automatic text summarization
algorithms.

If no metadata
exists, we can
select a striking
image from the
images
available in the
document
Which of the images
outlined in red is the
striking one chosen by the
author?
How would a machine
know which one to choose
if there were no striking
image specified in the
metadata?
94

Our generic image selection
approach has 3 steps
1. Score each image in the
document by some
approach (e.g., ML
probability, feature
value)
2. Sort the list of images by
descending score (e.g.,
highest ML probability is
first, image with most
colors is first)
3. Choose the image at the
beginning of the list
(highest scoring)
154,131
colors
Sorted by color
count
Sorted by
classifier probability
48,020
colors
44,737
colors
30,940
colors
0.3623
0.1948
0.1259
3,816
colors
0.1116
0.11
(resized)
(cropped)
(resized)
(cropped)
(larger)
95

We visualized how well different approaches performed at
choosing a striking image that was perceptually the same as
the author’s
96
The best
approach
starts here
As we proceed to the right, we
accept more images as
perceptually equal to the one
selected by the approach
All lines converge
as any image
becomes
acceptable as
correct
Higher scores
indicate more
accurate
answers
Remember:
we are trying to find the
approach that best selects
the striking image chosen
by the author
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.

We found that Random Forest performed best with base image
features quickly calculated via standard image libraries
97
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
P@1=0.831
MRR=0.883
base image features:
• byte size
• width in pixels
• height in pixels
• negative space
(# of histogram cols = 0)
• size in pixels
• aspect ratio
• number of colors

RQ4: What methods that automate the creation of surrogates
produce results that best match humans' behavior?
98
Authors write card descriptions that
are 268 characters, 52 words, or 2
sentences long. We can use this
length as input to automatic text
summarization algorithms, like
TextRank.
With base image features Random Forest performed best
for choosing the same striking image as the author.
We analyzed the metadata
usage of news article mementos
over time.
Metadata fields associated with
cards had astronomical growth.

We implemented these results as part of
MementoEmbed
99
Cards
Browser
Thumbnails
Imagereels
Word
Clouds
As an archive-aware surrogate service, MementoEmbed provides different types of surrogates for mementos.
It also has an extensive API for generating document metadata.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.

Outline
Questions
100

Because Storify was gone, we created
Raintale for visualizing and distributing stories
101
Visualizing And Distributing Stories
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
Storify provided an API, allowing us to configure
the look and feel of our story.
With this functionality gone, we created
Raintale, a platform agnostic storytelling tool
that generates files or social media posts.

Remember, Elbert wants to promote his collections for
others, and he uses the DSA Toolkit to do so
102
Today he is
promoting a
collection about
COVID-19.
From this: 23,376 mementos To this: a sample of 36 mementos
visualized as social cards, phrases,
and images

Elbert applies all processes of our storytelling model
103
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story

Remember, Natasha needs to compare
collections to each other
104
Today she is
reviewing
different
collections
about
shootings.
Virginia Tech El Paso Norway

Ling inherited a collection and needs to
know what it contains
105
Ling can apply our
processes with a
different template
to include other
information, like
structural features.
To this: 50 exemplars, structural
features, metadata analysis,
growth curves, and more
From this: 88,755 mementos
and no metadata

Rustam wants to see how a page changed
over time
106
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Rustam uses
Hypercane to help
him choose a page
and then view its
change over time.

Rustam chooses one of Raintale’s default templates
because he is using the DSA Toolkit for exploration
107
Rustam’s story
seems plain, but he
is really interested in
the changing text
over time.

Olayinka wants to see what different news sources said
on the same day in different years
108
With our SHARI
process, she can
compare different
years to each other
2018
US Elections
2020
COVID-19
2019
Mass shootings in El Paso
and Dayton
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.

Olayinka can look through the stories produced by our
SHARI process to perform her comparisons
109
Our process is not
just limited to our
implementation,
and allows us to
incorporate input
from other systems,
like StoryGraph.
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Select
Exemplars
Generate
Story
Metadata

Outline
Questions
110

We presented a model for storytelling with web archives
111
Contributions

We established a vocabulary for different
types and structural features of collections
112
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us
some idea of the curatorial
involvement with a
collection over time.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage
iPres 2018
Contributions

Word count is a fast, effective intra-TimeMap
method of identifying off-topic mementos
113
iPres 2018
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Technical problems
Page gone
Hacking
Moving on from topic
Contributions

We devised a set of primitives for intelligently selecting
exemplars from web archive collections
114
Contributions

Hypercane implements our primitives for
selecting exemplars
115
ACM/IEEE
JCDL 2021
ACM SIGWEB
Newsletter 2021
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Intelligent Sampling for Web Archive
Collections. In ACM/IEEE Joint Conference on Digital Libraries, 2021. [To be published in 2021]
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Toolkit for Summarizing Large
Collections of Archived Pages. In SIGWEB Newsletter Autumn, 2021. [To be published in 2021]
Contributions

We created four different algorithms from these
primitives and found that they produce exemplars with
low retrievability with a state-of-the-art search engine
116
We applied four different
query methods to the
mementos surfaced by
these algorithms.
As designed, our DSA4
algorithm surfaced more
novel exemplars than
those discoverable via
the search engine.
We measured mean
retrievability and zero
retrievability to determine
how easy a document
was to retrieve with the
given query method.
Contributions

Our user study provides engineers support for
choosing social cards over other surrogate types
117
From our user study, correct answers per
surrogate indicate that social cards
probably outperform the Archive-It
surrogate
With social cards, users were able to
correctly answer our questions without
as much interaction.
ACM CIKM 2019
Web Archive Collections,” In ACM International Conference on Information and Knowledge
Management, 2019. https://doi.org/10.1145/3357384.3358039.
Contributions

We established methods for generating the metadata
for social cards if it does not exist
118
ACM Web Science
2021
For choosing striking
images, we trained
classifiers using base image
features (e.g., pixel size,
color count) to choose the
same striking image that
web page authors chose.
Random Forest with these
base image features
performed best.
Contributions

We explored the reasons for metadata adoption
119
ACM/IEEE
JCDL 2021
Many efforts have been
made to encourage
metadata adoption by
web pages authors.
Once social card
metadata became
available, its use
skyrocketed!
Contributions

We released MementoEmbed and Raintale as reference
implementations for visualizing and distributing stories
120
WADL 2020 WADL 2020
We detailed how to generate
document metadata with
MementoEmbed and visualize and
distribute the story with Raintale.
We also provided an example of
these processes for a day’s news.
Contributions

And I am eager to apply this
expertise at
Los Alamos National Laboratory’s
Information Sciences Division
(CCS-3)
121
https://oduwsdl.github.io/dsa-puddles/shawnmjones/

Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
122
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/

Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
123
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/
What story will you tell with web archives?

Backup Slides
124

As collection users, we view Archive-It collections
from outside…
125
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds

Response times per surrogate had interesting
means, but p-values were not statistically
significant at p < 0.05
126
0 20 40 60 80 100 120 140 160
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Response Times Per Surrogate
Median Mean
p = 0.190
p = 0.202

The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
127

Does most of the collection exist earlier or later in its
life?
128
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later
in its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection

When did the curator select and archive a collection’s
contents?
129
This collection was created in
March 2006.
Some of the seeds were
selected in 2006.
Many of the seeds were
selected all along its life.
It has mementos as recent as
July 2018.
• area under the seed growth curve

Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
130
This collection was
created in June 2014.
The seeds were selected
toward the beginning of
its life.
Mementos were
captured all during its life.
• area under the seed growth curve
• area under the seed memento growth curve

The Memento Protocol provides us a standard
method for acquiring information from web archives
131
Memento gives us TimeGates – identified
by URI-G – for finding a specific memento
based on its original resource and
capture datetime, its memento-datetime.
Memento also gives us TimeMaps – identified by
URI-T – for listing all of the mementos for an original
resource and their memento-datetimes.
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self";
type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>;
rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last
memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT”
...
URI-R URI-T
URI-M
memento-datetime
URI-G
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.

We use surrogates all of the time!
132
Browser Thumbnail (example from UK Web Archive)
Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-
dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html, 2018.

Surrogates are not new!
Traditional surrogates contain metadata
generated by humans to convey aboutness
133
An individual surrogate
summarizes an item.
Card catalogs, however, were not stories, just manual methods for
finding individual items in collections.

Surrogates provide a visual summary of the
content behind a URI…
134
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:

Social media storytelling uses surrogates to
provide a “summary of summaries”
135
2 resources are shown from this Wakelet story
6 resources are shown from this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.

The Problem: Understanding
web archive collections is
costly
136
§ There are multiple collections about the
“same concept.”
§ The metadata for each collection is non-
existent, or inconsistently applied.
§ A seed is a web page to be crawled.
§ A memento is an observation of a seed at
a specific point in time.
1000s of seeds with multiple mementos.
§ There are more than 14,000 collections.
§ Archive-It is a popular platform, but other
web archive collection platforms exist
(e.g., Library of Congress, Conifer, Trove).
§ Existing solutions do not handle the time
dimension inherent to web archive
collections.
more seeds = less metadata

Our Solution: Social media storytelling uses groups
of surrogates to provide a “summary of summaries”
Each surrogate summarizes a web resource.
Each story groups the surrogates, summarizing the topic.
We want to use this technique to summarize web archive collections
because users are already familiar with this visualization paradigm.
We established a
five-process model
for storytelling with
web archive
collections
A surrogate
summarizes a
web page.
This surrogate
type is called a
social card.
Storytelling is the visualization. Our
contribution is the automation that
selects the exemplars and
metadata that make this story.

The problem, summarized
§ There are multiple collections
about the same concept.
§ The metadata for each
collection is non-existent, or
inconsistently applied.
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these
mementos for collection
understanding is an expensive
proposition.
138

Archive-It allows easy collection creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing
web archive collections. Curators can supply live web resources as seeds and establish crawling
schedules of those seeds to create mementos.
139

Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of
seeds
Each seed can have many
mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
140

More Archive-It collections are added every
year
More than 14,000 collections exist as of the end of 2020
141
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections

Latent Semantic Analysis for document
clustering
142
LSA utilizes a term-document matrix
• rows correspond to terms and columns
correspond to documents
• elements are typically weighted via TF-IDF
• if TF-IDF, then it is proportional to the
number of times the terms appear in
each document
• use single value decomposition to create
two new matrices
• the last of these matrices contains a set of
documents with coordinates for each cluster
LSA requires that the user supply the desired
number of topics. Dark cells indicate high weights.
High weights signify clustering.
Wikipedia contributors. (2019, July 26). Latent semantic analysis. In Wikipedia, The Free Encyclopedia.
Retrieved 21:31, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_semantic_analysis&oldid=907976703
it will be
difficult to
generalize
this number
across types
of collections

Latent Dirichlet Allocation
For a corpus D consisting of M documents each of length Ni
1. Choose where and
is a Dirichlet distribution with symmetric parameter
which typically is sparse ( )
2. Choose where and typically is sparse
3. For each of the word positions i, j where and
1. Choose a topic
2. Choose a word
Wikipedia contributors. (2019, July 25). Latent Dirichlet allocation. In Wikipedia, The Free Encyclopedia.
Retrieved 20:13, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_Dirichlet_allocation&oldid=907806560 143
K is the number of topics requested by the user
M is the number of documents in the corpus
N is the number of words
is the word distribution for topic k
is the topic distribution for document i
zij is the topic for the j-th word in document i
wij is a specific word in document i *e.g. of multinomial – probability of counts of
each side for rolling k-sided die n times
it will be difficult
to generalize this
number across
types of
collections

Many have tackled selecting exemplar sentences or
images from a document, few have covered selecting
exemplar documents from a corpus over time.
144
We are
inspired by
these
solutions and
will apply
some of their
ideas in a
moment.
Silva et al. word graphs Silva and Sampaio. 2014. Using Luhn’s Automatic Abstract Method to Create Graphs of
Words for Document Visualization. Social Networking. 65-70.
https://doi.org/10.4236/sn.2014.32008.
R. Sipos et al. 2012. Temporal corpus summarization using submodular word coverage.
In ACM CIKM 2012, 754-763. https://doi.org/10.1145/2396761.2396857.
Sipos et al. influential
author clusters

Existing tools for web archive collections require that the
user have access to WARCs.
145
ArchiveSpark Archives Unleashed
Cloud
(now part of Archive-It)
Archivists are the only
ones likely to have that
access. We want anyone
to be able to summarize a
collection.
Warclight
Holzmann et al. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In ACM/IEEE JCDL
2016, 83-92. https://doi.org/10.1145/2910896.2910902.
Ruest et al. 2014. archivesunleashed/warclight – A Rails engine supporting the discovery of web archives.
https://github.com/archivesunleashed/warclight.
Deschamps et al. 2019. The Cost of a WARC: Analyzing Web Archives in the Cloud. In
ACM/IEEE JCDL 2019, 261-264. https://doi.org/10.1109/JCDL.2019.00043.
Stories also need URIs for
linking surrogates. WARCs
alone cannot do this.

Existing work on generating story metadata relies
on archivists to manually review and annotate
each seed or memento
146
Scale is the greatest
challenge here. Web
archive collections
grow quickly, and
archivists have a
hard time keeping up
with the number of
documents to
annotate.
D. V. Pitti, “Encoded Archival Description,” D-Lib Magazine, vol. 5, no. 11, 1999.
https://doi.org/10.1045/november99-pitti.
Encoded Archival Description could
work, if there were not thousands of
documents to annotate.

Other studies on surrogates did not focus on if participants
understood the underlying collection, instead whether
participants chose the correct search result for a query
147
These studies did not compare
thumbnails to social cards
directly.
Web archives love using
thumbnails, but is there
something better for visitors?

Others tried to visualize whole collections at once or
created solutions specific to a web archive
148
Conta Me Histórias
Padia et al.
R. Campos et al. 2021. Automatic generation of timelines for past-web events. The Past
Web: Exploring Web Archives, 225-242. https: //doi.org/10.1007/978-3-030-63291-5_18.
K. Padia, Y. AlNoamany, and M. C. Weigle, “Visualizing digital collections at Archive-It,”
in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries,
(Washington, DC, USA), pp. 15–18, 2012. https://doi.org/10.1145/ 2232817.2232821.

Web surrogates provide a visual summary
of the content behind a URI…
149
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:

Social media storytelling uses surrogates to provide a
“summary of summaries”
150
2 resources are shown in this Wakelet story
6 resources are shown in this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.

DSA2 Algorithm
151

DSA3 Algorithm
152

DSA4 Algorithm
153

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

Similaire à Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense (20)

Plus de Shawn Jones

Plus de Shawn Jones (15)

Dernier

Dernier (20)

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense