SlideShare une entreprise Scribd logo
1  sur  153
Télécharger pour lire hors ligne
Improving Collection Understanding
For Web Archives With Storytelling:
Shining Light Into
Dark and Stormy Archives
Shawn M. Jones
Los Alamos National Laboratory
Research Library Prototyping Team
Web Science and Digital Libraries Research Group
Old Dominion University
Dissertation Defense 2021/08/05
1
Thanks to:
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
2
@shawnmjones @StormyArchives
November 8, 2019
3
During the second week of November 2019,
the National Center for Medical Intelligence shared
intelligence based on "monitoring of internal Chinese
communications" that warned of a potential novel
coronavirus pandemic coming out of Wuhan.
Source: https://en.wikipedia.org/wiki/Timeline_of_the_COVID-
19_pandemic_in_2019
COVID-19 was not named and was only
known to a small group in the US.
No news coverage existed.
@shawnmjones @StormyArchives
December 16, 2019
4
The first documented COVID-19 hospital
admission was on December 16, 2019.
COVID-19 was still not well known and
received no news coverage.
@shawnmjones @StormyArchives
January 13, 2020
5
One month later, CNN carries a coronavirus
category on its front page.
@shawnmjones @StormyArchives
February 28, 2020
6
Another month goes by with more front-page
articles about coronavirus.
@shawnmjones @StormyArchives
March 13, 2020
7
A month later, CNN had many front-page
articles about coronavirus with a special
Coronavirus heading for more articles.
@shawnmjones @StormyArchives
March 20, 2020
8
A week later, states are locking down.
@shawnmjones @StormyArchives
March 27, 2020
9
A week later, the US has the most cases of any
country.
@shawnmjones @StormyArchives
A web archive
helped me tell
this story.
10
These mementos are stored
in the Internet Archive.
They are full captures of the
web code that existed on
those dates.
@shawnmjones @StormyArchives
What other stories can we tell with web
archives?
11
Motivation and Research Questions
@shawnmjones @StormyArchives
Natasha is studying how disasters shape
cultures...
12
Sources like Wikipedia now have a
summary of the event after the
fact.
Today she is
reviewing the
South Louisiana
Flood of 2016.
Motivation and Research Questions
She wants to know about
the news reporting as it was
at the time of the event.
@shawnmjones @StormyArchives
Per Nwala et al., news articles about the event tend to slide
down search results as we get further from the event.
13
Motivation and Research Questions
Green = coverage of event
Red = Summaries of the event
A. C. Nwala, M. C. Weigle, and M. L. Nelson, “Scraping SERPs for Archival Seeds: It Matters
When You Start,” in ACM/IEEE JCDL, 2018. https://doi.org/10.1145/3197026.3197056.
She knows that
five years later,
it is harder to
find news
articles from
the event itself.
@shawnmjones @StormyArchives
Natasha also knows that news articles are updated with
more current and correct information
14
She wants to
know about
the news
reporting as it
was at the time
of the event.
Motivation and Research Questions
Today
8/14/2016
during event
@shawnmjones @StormyArchives
Natasha knows that any time that we need proof
that X said Y at date D, we need web archives
15
She knows that
web archives
contain not just
“screenshots”
but full
captures of
web code as
mementos.
To start, she must know a
URL and capture
datetime.
Then she can view a
memento.
And she can review its
code, if needed.
Motivation and Research Questions
@shawnmjones @StormyArchives
Natasha also knows that archivists create
web archive collections based on a theme
16
Motivation and Research Questions
@shawnmjones @StormyArchives
With these themed collections, she can discover documents
that once existed and match her event or topic
17
Virginia Tech: Crisis, Tragedy, and
Recovery Network capturing
coverage of the 2011 Tucson Shootings
University of Utah capturing its
web presence over time
Motivation and Research Questions
@shawnmjones @StormyArchives
Natasha has discovered multiple sites with
themed web archive collections
18
Library of Congress
Archive-It
(by the Internet Archive)
Trove
Conifer
Each site has
different
capabilities and
different types of
collections.
Motivation and Research Questions
@shawnmjones @StormyArchives
Natasha chooses to look through the
themed collections at Archive-It
19
As a popular subscription service of the Internet
Archive, Archive-It helps archivists create themed
collections.
These collections consist of seeds.
Mementos are observations of a seed at different
points in time.
For each seed, there are multiple mementos.
This seed has 7 mementos (captured 7 times).
Motivation and Research Questions
@shawnmjones @StormyArchives
There are multiple collections about the
subject, which one should she work with?
20
This is not the only
disaster she is studying.
She needs to waste as
little time as possible.
Motivation and Research Questions
@shawnmjones @StormyArchives 21
Natasha is not alone, 44
Archive-It collections
match the search query
“human rights”
How are they different
from each other?
Which one is best for
her needs?
Motivation and Research Questions
@shawnmjones @StormyArchives
Rustam needs to study how the Boston
Marathon Bombing unfolded…
22
Reviewing different
mementos of the
same seed allows
Rustam to
understand when
the public learned of
different events,
including when
misinformation was
corrected.
Rather than digging through collections manually, how can Rustam discover and view this more quickly?
Motivation and Research Questions
@shawnmjones @StormyArchives
Olayinka wants to understand what different
news sources revealed on the same day…
23
Today she is
trying to
understand the
different
reporting on the
September 11th
Attacks.
How can Olayinka discover and view this more quickly?
Motivation and Research Questions
@shawnmjones @StormyArchives
Elbert is an archivist who wants to promote his
collections, so others are aware of them…
24
He wants to help
visitors like
Natasha, Rustam,
and Olayinka
notice his
collections and use
them.
How does he create enticing visualizations that people can understand with minimal effort?
Motivation and Research Questions
@shawnmjones @StormyArchives
Ling is an archivist who inherited a collection from another
archivist, and she needs to understand it so she can make
decisions about it…
25
Her collection has
hundreds of
thousands of seeds.
Her predecessor did
not provide much
metadata with the
collection.
Archivists can add metadata to
collections, but many Archive-It
collections contain little metadata.
The more metadata a reader needs to
understand a collection, the less they
have available.
Motivation and Research Questions
@shawnmjones @StormyArchives
Ling knows she is not alone – the collections are often built
automatically, making it difficult to know what they contain
26
Web Archiving Technical Lead of the British Library
Ling knows that the
automation makes it
expensive to add
metadata to
thousands of
documents after
they are collected.
Motivation and Research Questions
@shawnmjones @StormyArchives
All these personas need a faster method of
collection understanding
27
Persona Natasha Rustam Olayinka Elbert Ling
Information
need
Quickly
compare
collections
Follow a source
over time
Understand a
time from
different
sources
Promote
collections and
help visitors
understand
them
Understand a
collection that
they inherited
Role Visitor Visitor Visitor Archivist Archivist
Understanding
needs
Overall
collection
Aspect (Page)
of a collection
Aspect (Time)
of a collection
Overall
collection
Overall
collection
Motivation and Research Questions
@shawnmjones @StormyArchives
All are faced with more than 14,000
collections at Archive-It alone
28
More than 14,000 collections exist as of the end of 2020
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections
Motivation and Research Questions
@shawnmjones @StormyArchives
The problem, summarized
29
§ There are multiple collections
about the same concept.
§ It is difficult to easily expose
aspects (e.g., time, page) of
collections.
§ The metadata for each collection
is non-existent, or inconsistently
applied.
§ Many collections have
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these mementos
for collection understanding is an
expensive proposition.
Motivation and Research Questions
@shawnmjones @StormyArchives
Our proposal: a visualization made of
exemplar mementos
30
§ Our visualization is a summary
that will act like an abstract
§ Pirolli and Card’s Information
Foraging Theory:
§ maximize the value of the
information gained from our
summaries
§ minimize the cost of interacting
with the collection
§ ensure that our exemplar
mementos have good
information scent
§ contain cues that the memento
will address a user’s needs
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
Motivation and Research Questions
@shawnmjones @StormyArchives
Users already interact with pages like this
every day
31
A story on Wakelet about the 2021
Capitol Attack
Motivation and Research Questions
A Twitter Moment of
astronaut Michael Collins
Twitter creates
Moments that
present surrogates
linking to content
about a topic of
interest.
Educators, librarians,
and others create
stories on Wakelet
about different
subjects.
@shawnmjones @StormyArchives
Social media stories apply visualizations that
users already know how to understand
32
An individual surrogate summarizes a web resource.
When we combine surrogates into a story, we
summarize a topic.
Motivation and Research Questions
@shawnmjones @StormyArchives
We developed a five-process storytelling model based
on existing work on summarization and storytelling
33
exemplar
mementos
collection title: 2013 Boston
Marathon Bombing
collected by: Internet
Archive Global Events
collection URL
image data...
seed data...
top terms
top entities...
title: Boston Marathon
Explosions...
description: “The
grace this tragedy
exposed...”
striking image..
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
AlNoamany found
that popular stories
contain 28 elements,
so we have a target
of 28 exemplars.
AlNoamany
pioneered this work
combining web
archive collections
with Storify, but Storify
is now gone.
Motivation and Research Questions
@shawnmjones @StormyArchives
Our five-process storytelling model maps to
our research questions
34
RQ1: What types of web archive
collections exist and what are
their structural features?
RQ2: What approaches work
best for selecting exemplars
from web archive collections?
RQ3: What surrogates work best
for understanding groups of
mementos?
RQ4: What methods that
automate the creation of
surrogates produce results that
best match humans’ behavior?
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Examples and Use Cases for our
Personas
Motivation and Research Questions
@shawnmjones @StormyArchives
Our Dark and Stormy Archives Tools serve as a
reference implementation of our storytelling process
35
Motivation and Research Questions
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
36
@shawnmjones @StormyArchives
URIs identify resources
37
T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax”.
https://www.rfc-editor.org/rfc/rfc3986.txt, 2005.
Jacobs, I. and Walsh, N. eds., “Architecture of the World Wide Web, Vol. 1.”
https://www.w3.org/TR/webarch/, 2003.
URIs are a superset of identifiers that
contains URLs, URNs, etc.
Background and Related Work
URIs identify resources, which have
different representations
depending on the visitor’s needs.
@shawnmjones @StormyArchives
HTML is the file format we use for web
resources
38
HTML contains links to other
pages, identified by URIs.
Background and Related Work
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
39
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
40
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
41
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
@shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
42
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
@shawnmjones @StormyArchives
A TimeMap gives us a listing of the
mementos available for an original resource
43
Background and Related Work
the original resource
“now”
<http://www.cs.odu.edu>;rel="original",
<https://web.archive.org/web/19970102130137/http://cs.odu.edu:80/>;rel="memento"; datetime="Thu, 02 Jan 1997
13:01:37 GMT",
<https://web.archive.org/web/19970606105039/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 06 Jun 1997
10:50:39 GMT",
<http://archive.md/19970606105039/http://www.cs.odu.edu/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT",
<https://web.archive.org/web/19971010201632/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 10 Oct 1997
20:16:32 GMT", <https://web.archive.org/web/19971211124211/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Thu,
11 Dec 1997 12:42:11 GMT",
...
<https://web.archive.org/web/19990502033600/http://cs.odu.edu:80/>;rel="memento"; datetime="Sun, 02 May 1999
03:36:00 GMT",
...
<https://arquivo.pt/wayback/20091223043049mp_/http://www.cs.odu.edu/>;rel="memento"; datetime="Wed, 23 Dec 2009
04:30:49 GMT",
...
memento from 1997
memento from 1999 memento from 2009
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
@shawnmjones @StormyArchives
Others have tackled portions of the problem of summarizing
web archives, but only AlNoamany addressed all processes
44
Background and Related Work
Some have conflated our
steps of generating
metadata and visualizing it.
Many have and continue to
focus on selecting exemplar
words, sentences, images,
video clips, and more for
summarization.
Those who have evaluated
surrogates in the past focused
on if the participant chose the
correct search engine result,
but not understanding.
Attempts to manually apply metadata
to these collections are impacted by
the scale of the problem.
@shawnmjones @StormyArchives
AlNoamany identified the characteristics of social
media stories and Archive-It collections
45
Background and Related Work
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
By analyzing the characteristics
of stories and collections, she
determined that popular
stories contain 28 elements.
Our model maps to hers but
expands her visualize step.
AlNoamany’s
sieve diagram
gives us one
solution for
storytelling. We will
explore others.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
@shawnmjones @StormyArchives
Select
Exemplars
AlNoamany extracted some story metadata and relied on
Storify to create and distribute the resulting visualization.
46
Background and Related Work
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
@shawnmjones @StormyArchives
Her proof-of-concept generated some document
metadata and relied on Storify to generate the rest.
47
Background and Related Work
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
@shawnmjones @StormyArchives
She generated many different stories based on
exemplars selected by her proof-of-concept
48
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Background and Related Work
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
@shawnmjones @StormyArchives
Through a user study, she demonstrated that participants
could tell the difference between her solution’s stories and
randomly generated stories
49
Background and Related Work
Participants could not tell the
difference between her
solution’s stories and those
generated by human archivists
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
@shawnmjones @StormyArchives
Unfortunately, her solution is difficult to generalize
50
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Background and Related Work
Adobe shut down
the Storify platform
in 2018.
AlNoamany’s POC
focused on Archive-It.
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
51
@shawnmjones @StormyArchives
As collection users, what structural features
can we view from outside?
52
§ Using only structural features is
advantageous because it saves
one from having to download a
collection’s content.
§ These structural features give us
different insight than can be
provided by text analysis or
metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Was the collection built from web sites belonging
to one domain or many?
53
Many domains One domain
Structural feature
discussed here:
• domain diversity
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
54
Top-level pages Deeper links
Structural feature
discussed here:
• path depth diversity
• most frequent path
depth
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Growth curves provide some understanding of
collection curation behavior
55
• Skew of the
collection’s holdings
• Indicates
temporality of
collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
We discovered four semantic categories in
Archive-It collections
56
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
Self-Archiving collections dominate Archive-It
57
54.1% of collections
27.6% 14.1%
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
Subject-based Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Organizations
archiving themselves
or those they are
responsible for.
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
Subject-based collections come in second
58
27.6% of collections
14.1%
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Collections centered
on a subject that is
not ephemeral.
54.1%
Self-archiving
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
Time Bounded – Expected
collections summarize events
we anticipate
59
14.1% of collections
In a study of 3,382
Archive-It collections
Selecting Exemplars and Generating Story Metadata
Time Bounded
– Spontaneous
4.2%
Collections about an
anticipated event.
54.1%
Self-archiving
27.6%
Subject-based
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives 60
4.2% of collections
In a study of 3,382
Archive-It collections
Selecting Exemplars and Generating Story Metadata
Collections about an
unexpected event.
Some of these were
evaluated by AlNoamany.
54.1%
Self-archiving
27.6%
Subject-based
14.1%
Time Bounded
– Expected
Time Bounded – Spontaneous
collections summarize
unexpected events
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
We can bridge the structural to the
descriptive…
61
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
Selecting Exemplars and Generating Story Metadata
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
RQ1: What types of web archive collections exist
and what structural features do they have?
62
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us some idea
of the curatorial involvement with
a collection over time.
When selecting exemplars, we
need to summarize the collection
in terms of time and topic. The
shapes of these growth curves
indicate how we might cluster in
time.
This example growth curve
shows us that 30% of the
seeds were added early in
the collection’s life.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Identifying off-topic mementos is key to choosing
exemplar mementos
63
Hacked
Moved on from topic
Collections have a topic.
Seeds are selected to
support that topic.
Mementos are
observations of seeds.
Some of these versions are
off-topic.
Excluding these off-topic
mementos from
consideration is key to
selecting exemplars.
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
We found that Word Count had the best F1
score for identifying off-topic mementos
64
We reused AlNoamany’s
labeled dataset.
She did not try:
• Sorensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of AlNoamany’s.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web
archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
Y. AlNoamany and S. M. Jones, “Off-Topic Gold Standard Dataset,” GitHub. 2018.
https://github.com/oduwsdl/offtopic-goldstandard-data
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Filtering off-topic mementos is just one step in a set of
algorithmic primitives for selecting exemplars
65
We can filter the collection to get a
good set of exemplars and then
randomly sample from the remainder.
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Ordering allows us to create meaning from a
list of mementos
66
We can order the collection by some
feature and then systematically sample
every jth memento from the remainder.
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
filter
include-only mementos
containing a given pattern
Web
archive
collection
exemplars
reduces number of pages to consider
intention of steps
order by descending score
order
score scores results
score results with BM25 scores results
orders results
Scoring gives us an idea of how well a memento meets
the information needs represented by a function
67
We can combine filter, score, and order
to create a simple search engine.
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Clustering based on a feature allows us to
imbue subsets of mementos with meaning
68
With these primitives, we can
reproduce AlNoamany’s Algorithm
which we will now call DSA1.
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
These primitives allow us to create other algorithms for
selecting exemplars that tell the story the user desires
69
DSA2 focuses on representing collection
growth curves and scoring mementos
by their surrogate metadata.
DSA3 focuses on mementos that best
match the collection topic.
DSA4 focuses on finding the most novel
mementos in the collection.
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Search engines are the de-facto method of exploring
collections; if we consider them a baseline, then how
retrievable are the exemplars produced by DSA algorithms?
70
Selecting Exemplars and Generating Story Metadata
We loaded 8
different Archive-It
collections into
different instances
of the SolrWayback
web archive
search engine.
We also executed
4 different DSA
algorithms to
produce exemplars
from these
collections.
Web
archive
collection
exemplars
Web
archive
collection
exemplars
exemplars
Web
archive
collection
Web
archive
collection
exemplars
@shawnmjones @StormyArchives
We then generated queries with four different methods based on
the content of the exemplars produced by each DSA algorithm
71
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
We visualized the percentage of exemplars
that were never retrieved by any query
72
Selecting Exemplars and Generating Story Metadata
x-axis
the number of search results to
review before we find the exemplar
y-axis
the percentage
of exemplars that
have zero
retrievability
In this graph, we are reporting
zero retrievability with:
• queries from doc2query-T5
• for exemplars chosen by DSA3
At 10 search results,
57.82% of the exemplars
were not retrieved.
After 1000 search results,
36.05% of the exemplars
were not retrieved.
@shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
73
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
74
Selecting Exemplars and Generating Story Metadata
If all pages are relevant, then DSA algorithms produce mementos with more novelty than standard query
methods can with a state-of-the-art web archive search engine.
DSA4 was designed to surface more novel mementos and meets its goal in these results.
@shawnmjones @StormyArchives
RQ2: Which approaches work best for selecting
exemplars from web archive collections?
75
We established that four different
algorithms produced from these
primitives will select exemplars
that were not retrievable using
standard query methods and a
state-of-the-art web archive
search engine.
Removing off-topic mementos is but one step
toward selecting exemplars.
We devised a set of primitives for creating
many different types of sampling algorithms
that consider structural features.
An important step in
selecting exemplars to
summarize the collection is
identifying off-topic
mementos. We found that
word count differences
work best.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
We implemented these primitives as part of Hypercane
76
Hypercane was used to
conduct the experiments in
this section.
Selecting Exemplars and Generating Story Metadata
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson. 2021. Hypercane: Intelligent Sampling for Web
Archive Collections. In ACM/IEEE JCDL 2021. [to be published in September 2021]
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
77
@shawnmjones @StormyArchives
We evaluated 55 platforms in 2017 and found that existing social
platforms do not reliably produce surrogates for mementos
78
Generating Document Metadata
If we cannot rely upon the service to generate a surrogate, we will have to create
our own.
Which surrogate works best for understanding web archive collections?
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00137
@shawnmjones @StormyArchives
We reused exemplars that archivists had selected to
describe their own collections to create stories with
different surrogates...
79
Generating Document Metadata
@shawnmjones @StormyArchives
Archive-It like surrogates visualize these
mementos as they are on Archive-It
80
Archive-It like surrogate
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
Browser thumbnails are screenshots of the page in a
browser
81
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
Browser thumbnails
Browser thumbnails
are a popular
surrogate type used
at web archives.
This is a screenshot of the exemplars
selected by the archivists of the
Archive-It collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
Social cards come from social media
platforms
82
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
Social cards
Social cards are a type
of surrogate typically
found on social media
platforms like
Facebook or Twitter.
These social cards were
specially designed to
include information
from web archives.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
sc/t combines social cards and thumbnails
83
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc/t
We replaced the
striking image of the
social card with a
browser thumbnail.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
sc+t places the social card to the left and a thumbnail
to the right
84
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc+t
Our thought was that
more information was
better.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
sc^t is interactive
85
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc^t
When a user hovers
over the striking image
the browser thumbnail
appears.
This provides both types
of surrogates in a smaller
space.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
@shawnmjones @StormyArchives
We then presented these stories to Mechanical Turk (MT)
participants
86
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image (sc/t)
Social Card
With Thumbnail
to Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 URI-Ms selected by human
Archive-It curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
@shawnmjones @StormyArchives
And then asked them which of the following come from
the same collection…
87
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This process is like the Sentence Verification Task used in reading comprehension studies.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
@shawnmjones @StormyArchives
Social cards probably outperform the Archive-It
surrogate for participant’s correct answers
88
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
@shawnmjones @StormyArchives
Social cards produced less interaction while participants
viewed their stories
89
We measured clicks and hovers by participants while they were viewing their stories.
For browser thumbnails alone, most of the participants clicked the link to view the actual
memento behind the surrogate.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
@shawnmjones @StormyArchives
RQ3: What surrogates work best for
understanding groups of mementos?
90
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Correct answers per surrogate indicate that social cards
probably outperform the Archive-It surrogate
• 4 stories of 15-17 mementos selected by human curators from their own collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• Each given 30 seconds to view a story, then asked a question
From a user study with
120 Mechanical Turk
participants:
With social cards, users were able to
correctly answer our questions without as
much interaction.
Generating Document Metadata
@shawnmjones @StormyArchives
Social cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twitter:title
-or-
<title>
og:description
-or-
twitter:description
-or-
description
og:image
-or-
twitter:image
Without twitter:card and og:title or twitter:title, Twitter gives up and does not generate a card.
Facebook parses the <title> and produces a card with just a title.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899. 91
Generating Document Metadata
What do we do if this
metadata does not exist?
@shawnmjones @StormyArchives
We analyzed 277,724 news articles captured by the Internet
Archive from 1998 to 2016, and found different rates
of metadata adoption
OGP = Open Graph Protocol
Facebook Cards
150 billion
documents in the
Internet Archive
were captured
before 2010 and
thus have no card
metadata
92
Generating Document Metadata
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
@shawnmjones @StormyArchives
By applying author behavior, we can
generate descriptions
93
Generating Document Metadata
We used the existing field values, written by page authors, as
ground truth data.
It tells us that authors tend to write card descriptions that have
the following lengths:
• 268 characters
• 52 words
• 2 sentences
We can use this length as input to automatic text summarization
algorithms.
@shawnmjones @StormyArchives
Generating Document Metadata
If no metadata
exists, we can
select a striking
image from the
images
available in the
document
Which of the images
outlined in red is the
striking one chosen by the
author?
How would a machine
know which one to choose
if there were no striking
image specified in the
metadata?
94
@shawnmjones @StormyArchives
Our generic image selection
approach has 3 steps
1. Score each image in the
document by some
approach (e.g., ML
probability, feature
value)
2. Sort the list of images by
descending score (e.g.,
highest ML probability is
first, image with most
colors is first)
3. Choose the image at the
beginning of the list
(highest scoring)
154,131
colors
Sorted by color
count
Sorted by
classifier probability
48,020
colors
44,737
colors
30,940
colors
0.3623
0.1948
0.1259
3,816
colors
0.1116
0.11
(resized)
(cropped)
(resized)
(cropped)
(larger)
95
Generating Document Metadata
@shawnmjones @StormyArchives
We visualized how well different approaches performed at
choosing a striking image that was perceptually the same as
the author’s
96
The best
approach
starts here
As we proceed to the right, we
accept more images as
perceptually equal to the one
selected by the approach
All lines converge
as any image
becomes
acceptable as
correct
Higher scores
indicate more
accurate
answers
Remember:
we are trying to find the
approach that best selects
the striking image chosen
by the author
Generating Document Metadata
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
@shawnmjones @StormyArchives
We found that Random Forest performed best with base image
features quickly calculated via standard image libraries
97
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
Generating Document Metadata
P@1=0.831
MRR=0.883
base image features:
• byte size
• width in pixels
• height in pixels
• negative space
(# of histogram cols = 0)
• size in pixels
• aspect ratio
• number of colors
@shawnmjones @StormyArchives
RQ4: What methods that automate the creation of surrogates
produce results that best match humans' behavior?
98
Generating Document Metadata
Authors write card descriptions that
are 268 characters, 52 words, or 2
sentences long. We can use this
length as input to automatic text
summarization algorithms, like
TextRank.
With base image features Random Forest performed best
for choosing the same striking image as the author.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
We analyzed the metadata
usage of news article mementos
over time.
Metadata fields associated with
cards had astronomical growth.
@shawnmjones @StormyArchives
We implemented these results as part of
MementoEmbed
99
Cards
Browser
Thumbnails
Imagereels
Word
Clouds
Generating Document Metadata
As an archive-aware surrogate service, MementoEmbed provides different types of surrogates for mementos.
It also has an extensive API for generating document metadata.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00137
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
100
@shawnmjones @StormyArchives
Because Storify was gone, we created
Raintale for visualizing and distributing stories
101
Visualizing And Distributing Stories
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
Storify provided an API, allowing us to configure
the look and feel of our story.
With this functionality gone, we created
Raintale, a platform agnostic storytelling tool
that generates files or social media posts.
@shawnmjones @WebSciDL
Remember, Elbert wants to promote his collections for
others, and he uses the DSA Toolkit to do so
102
Today he is
promoting a
collection about
COVID-19.
Visualizing And Distributing Stories
From this: 23,376 mementos To this: a sample of 36 mementos
visualized as social cards, phrases,
and images
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
@shawnmjones @StormyArchives
Elbert applies all processes of our storytelling model
103
Visualizing And Distributing Stories
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
@shawnmjones @WebSciDL
Remember, Natasha needs to compare
collections to each other
104
Today she is
reviewing
different
collections
about
shootings.
Virginia Tech El Paso Norway
Visualizing And Distributing Stories
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
@shawnmjones @StormyArchives
Ling inherited a collection and needs to
know what it contains
105
Ling can apply our
processes with a
different template
to include other
information, like
structural features.
Visualizing And Distributing Stories
To this: 50 exemplars, structural
features, metadata analysis,
growth curves, and more
From this: 88,755 mementos
and no metadata
@shawnmjones @WebSciDL
Rustam wants to see how a page changed
over time
106
Visualizing And Distributing Stories
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Rustam uses
Hypercane to help
him choose a page
and then view its
change over time.
@shawnmjones @StormyArchives
Rustam chooses one of Raintale’s default templates
because he is using the DSA Toolkit for exploration
107
Visualizing And Distributing Stories
Rustam’s story
seems plain, but he
is really interested in
the changing text
over time.
@shawnmjones @WebSciDL
Olayinka wants to see what different news sources said
on the same day in different years
108
Visualizing And Distributing Stories
With our SHARI
process, she can
compare different
years to each other
2018
US Elections
2020
COVID-19
2019
Mass shootings in El Paso
and Dayton
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
@shawnmjones @StormyArchives
Olayinka can look through the stories produced by our
SHARI process to perform her comparisons
109
Visualizing And Distributing Stories
Our process is not
just limited to our
implementation,
and allows us to
incorporate input
from other systems,
like StoryGraph.
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Select
Exemplars
Generate
Story
Metadata
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
@shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
110
@shawnmjones @StormyArchives
We presented a model for storytelling with web archives
111
Contributions
@shawnmjones @WebSciDL
We established a vocabulary for different
types and structural features of collections
112
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us
some idea of the curatorial
involvement with a
collection over time.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage
iPres 2018
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Contributions
@shawnmjones @StormyArchives
Word count is a fast, effective intra-TimeMap
method of identifying off-topic mementos
113
iPres 2018
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Technical problems
Page gone
Hacking
Moving on from topic
Contributions
@shawnmjones @WebSciDL
We devised a set of primitives for intelligently selecting
exemplars from web archive collections
114
Contributions
@shawnmjones @StormyArchives
Hypercane implements our primitives for
selecting exemplars
115
ACM/IEEE
JCDL 2021
ACM SIGWEB
Newsletter 2021
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Intelligent Sampling for Web Archive
Collections. In ACM/IEEE Joint Conference on Digital Libraries, 2021. [To be published in 2021]
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Toolkit for Summarizing Large
Collections of Archived Pages. In SIGWEB Newsletter Autumn, 2021. [To be published in 2021]
Contributions
@shawnmjones @WebSciDL
We created four different algorithms from these
primitives and found that they produce exemplars with
low retrievability with a state-of-the-art search engine
116
We applied four different
query methods to the
mementos surfaced by
these algorithms.
As designed, our DSA4
algorithm surfaced more
novel exemplars than
those discoverable via
the search engine.
We measured mean
retrievability and zero
retrievability to determine
how easy a document
was to retrieve with the
given query method.
Contributions
@shawnmjones @StormyArchives
Our user study provides engineers support for
choosing social cards over other surrogate types
117
From our user study, correct answers per
surrogate indicate that social cards
probably outperform the Archive-It
surrogate
With social cards, users were able to
correctly answer our questions without
as much interaction.
ACM CIKM 2019
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM International Conference on Information and Knowledge
Management, 2019. https://doi.org/10.1145/3357384.3358039.
Contributions
@shawnmjones @WebSciDL
We established methods for generating the metadata
for social cards if it does not exist
118
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
ACM Web Science
2021
For choosing striking
images, we trained
classifiers using base image
features (e.g., pixel size,
color count) to choose the
same striking image that
web page authors chose.
Random Forest with these
base image features
performed best.
Contributions
@shawnmjones @StormyArchives
We explored the reasons for metadata adoption
119
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
ACM/IEEE
JCDL 2021
Many efforts have been
made to encourage
metadata adoption by
web pages authors.
Once social card
metadata became
available, its use
skyrocketed!
Contributions
@shawnmjones @WebSciDL
We released MementoEmbed and Raintale as reference
implementations for visualizing and distributing stories
120
WADL 2020 WADL 2020
We detailed how to generate
document metadata with
MementoEmbed and visualize and
distribute the story with Raintale.
We also provided an example of
these processes for a day’s news.
Contributions
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
@shawnmjones @WebSciDL
And I am eager to apply this
expertise at
Los Alamos National Laboratory’s
Information Sciences Division
(CCS-3)
121
https://oduwsdl.github.io/dsa-puddles/shawnmjones/
@shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
122
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/
@shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
123
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/
What story will you tell with web archives?
@shawnmjones @WebSciDL
Backup Slides
124
@shawnmjones @StormyArchives
As collection users, we view Archive-It collections
from outside…
125
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
@shawnmjones @StormyArchives
Response times per surrogate had interesting
means, but p-values were not statistically
significant at p < 0.05
126
0 20 40 60 80 100 120 140 160
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Response Times Per Surrogate
Median Mean
p = 0.190
p = 0.202
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
@shawnmjones @StormyArchives
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
127
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
@shawnmjones @StormyArchives
Does most of the collection exist earlier or later in its
life?
128
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later
in its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
When did the curator select and archive a collection’s
contents?
129
This collection was created in
March 2006.
Some of the seeds were
selected in 2006.
Many of the seeds were
selected all along its life.
It has mementos as recent as
July 2018.
Structural feature discussed here:
• area under the seed growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
130
This collection was
created in June 2014.
The seeds were selected
toward the beginning of
its life.
Mementos were
captured all during its life.
Structural feature discussed here:
• area under the seed growth curve
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
@shawnmjones @StormyArchives
The Memento Protocol provides us a standard
method for acquiring information from web archives
131
Background and Related Work
Memento gives us TimeGates – identified
by URI-G – for finding a specific memento
based on its original resource and
capture datetime, its memento-datetime.
Memento also gives us TimeMaps – identified by
URI-T – for listing all of the mementos for an original
resource and their memento-datetimes.
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self";
type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>;
rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last
memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>;
rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>;
rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT”
...
URI-R URI-T
URI-M
memento-datetime
URI-G
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
@shawnmjones @StormyArchives
We use surrogates all of the time!
132
Browser Thumbnail (example from UK Web Archive)
Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-
dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html, 2018.
Motivation and Research Questions
@shawnmjones @WebSciDL
Surrogates are not new!
Traditional surrogates contain metadata
generated by humans to convey aboutness
133
An individual surrogate
summarizes an item.
Card catalogs, however, were not stories, just manual methods for
finding individual items in collections.
Motivation and Research Questions
@shawnmjones @StormyArchives
Surrogates provide a visual summary of the
content behind a URI…
134
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:
Background and Related Work
@shawnmjones @WebSciDL
Social media storytelling uses surrogates to
provide a “summary of summaries”
135
2 resources are shown from this Wakelet story
6 resources are shown from this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.
@shawnmjones @StormyArchives
The Problem: Understanding
web archive collections is
costly
136
§ There are multiple collections about the
“same concept.”
§ The metadata for each collection is non-
existent, or inconsistently applied.
§ A seed is a web page to be crawled.
§ A memento is an observation of a seed at
a specific point in time.
§ Many collections have
1000s of seeds with multiple mementos.
§ There are more than 14,000 collections.
§ Archive-It is a popular platform, but other
web archive collection platforms exist
(e.g., Library of Congress, Conifer, Trove).
§ Existing solutions do not handle the time
dimension inherent to web archive
collections.
more seeds = less metadata
@shawnmjones @StormyArchives 137
Our Solution: Social media storytelling uses groups
of surrogates to provide a “summary of summaries”
Each surrogate summarizes a web resource.
Each story groups the surrogates, summarizing the topic.
We want to use this technique to summarize web archive collections
because users are already familiar with this visualization paradigm.
We established a
five-process model
for storytelling with
web archive
collections
A surrogate
summarizes a
web page.
This surrogate
type is called a
social card.
Storytelling is the visualization. Our
contribution is the automation that
selects the exemplars and
metadata that make this story.
@shawnmjones @StormyArchives
The problem, summarized
§ There are multiple collections
about the same concept.
§ The metadata for each
collection is non-existent, or
inconsistently applied.
§ Many collections have
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these
mementos for collection
understanding is an expensive
proposition.
138
@shawnmjones @StormyArchives
Archive-It allows easy collection creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing
web archive collections. Curators can supply live web resources as seeds and establish crawling
schedules of those seeds to create mementos.
139
@shawnmjones @StormyArchives
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of
seeds
Each seed can have many
mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
140
@shawnmjones @StormyArchives
More Archive-It collections are added every
year
More than 14,000 collections exist as of the end of 2020
141
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections
@shawnmjones @StormyArchives
Latent Semantic Analysis for document
clustering
142
LSA utilizes a term-document matrix
• rows correspond to terms and columns
correspond to documents
• elements are typically weighted via TF-IDF
• if TF-IDF, then it is proportional to the
number of times the terms appear in
each document
• use single value decomposition to create
two new matrices
• the last of these matrices contains a set of
documents with coordinates for each cluster
LSA requires that the user supply the desired
number of topics. Dark cells indicate high weights.
High weights signify clustering.
Wikipedia contributors. (2019, July 26). Latent semantic analysis. In Wikipedia, The Free Encyclopedia.
Retrieved 21:31, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_semantic_analysis&oldid=907976703
it will be
difficult to
generalize
this number
across types
of collections
@shawnmjones @StormyArchives
Latent Dirichlet Allocation
For a corpus D consisting of M documents each of length Ni
1. Choose where and
is a Dirichlet distribution with symmetric parameter
which typically is sparse ( )
2. Choose where and typically is sparse
3. For each of the word positions i, j where and
1. Choose a topic
2. Choose a word
Wikipedia contributors. (2019, July 25). Latent Dirichlet allocation. In Wikipedia, The Free Encyclopedia.
Retrieved 20:13, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_Dirichlet_allocation&oldid=907806560 143
K is the number of topics requested by the user
M is the number of documents in the corpus
N is the number of words
is the word distribution for topic k
is the topic distribution for document i
zij is the topic for the j-th word in document i
wij is a specific word in document i *e.g. of multinomial – probability of counts of
each side for rolling k-sided die n times
it will be difficult
to generalize this
number across
types of
collections
@shawnmjones @WebSciDL
Many have tackled selecting exemplar sentences or
images from a document, few have covered selecting
exemplar documents from a corpus over time.
144
Background and Related Work
We are
inspired by
these
solutions and
will apply
some of their
ideas in a
moment.
Silva et al. word graphs Silva and Sampaio. 2014. Using Luhn’s Automatic Abstract Method to Create Graphs of
Words for Document Visualization. Social Networking. 65-70.
https://doi.org/10.4236/sn.2014.32008.
R. Sipos et al. 2012. Temporal corpus summarization using submodular word coverage.
In ACM CIKM 2012, 754-763. https://doi.org/10.1145/2396761.2396857.
Sipos et al. influential
author clusters
@shawnmjones @StormyArchives
Existing tools for web archive collections require that the
user have access to WARCs.
145
ArchiveSpark Archives Unleashed
Cloud
(now part of Archive-It)
Archivists are the only
ones likely to have that
access. We want anyone
to be able to summarize a
collection.
Warclight
Background and Related Work
Holzmann et al. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In ACM/IEEE JCDL
2016, 83-92. https://doi.org/10.1145/2910896.2910902.
Ruest et al. 2014. archivesunleashed/warclight – A Rails engine supporting the discovery of web archives.
https://github.com/archivesunleashed/warclight.
Deschamps et al. 2019. The Cost of a WARC: Analyzing Web Archives in the Cloud. In
ACM/IEEE JCDL 2019, 261-264. https://doi.org/10.1109/JCDL.2019.00043.
Stories also need URIs for
linking surrogates. WARCs
alone cannot do this.
@shawnmjones @StormyArchives
Existing work on generating story metadata relies
on archivists to manually review and annotate
each seed or memento
146
Scale is the greatest
challenge here. Web
archive collections
grow quickly, and
archivists have a
hard time keeping up
with the number of
documents to
annotate.
Background and Related Work
D. V. Pitti, “Encoded Archival Description,” D-Lib Magazine, vol. 5, no. 11, 1999.
https://doi.org/10.1045/november99-pitti.
Encoded Archival Description could
work, if there were not thousands of
documents to annotate.
@shawnmjones @StormyArchives
Other studies on surrogates did not focus on if participants
understood the underlying collection, instead whether
participants chose the correct search result for a query
147
These studies did not compare
thumbnails to social cards
directly.
Web archives love using
thumbnails, but is there
something better for visitors?
Background and Related Work
@shawnmjones @StormyArchives
Others tried to visualize whole collections at once or
created solutions specific to a web archive
148
Conta Me Histórias
Padia et al.
R. Campos et al. 2021. Automatic generation of timelines for past-web events. The Past
Web: Exploring Web Archives, 225-242. https: //doi.org/10.1007/978-3-030-63291-5_18.
K. Padia, Y. AlNoamany, and M. C. Weigle, “Visualizing digital collections at Archive-It,”
in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries,
(Washington, DC, USA), pp. 15–18, 2012. https://doi.org/10.1145/ 2232817.2232821.
Background and Related Work
@shawnmjones @StormyArchives
Web surrogates provide a visual summary
of the content behind a URI…
149
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:
@shawnmjones @StormyArchives
Social media storytelling uses surrogates to provide a
“summary of summaries”
150
2 resources are shown in this Wakelet story
6 resources are shown in this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.
@shawnmjones @StormyArchives
DSA2 Algorithm
151
@shawnmjones @StormyArchives
DSA3 Algorithm
152
@shawnmjones @StormyArchives
DSA4 Algorithm
153

Contenu connexe

Tendances

Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
IDERA Software
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 

Tendances (20)

Digital Collection Management with CONTENTdm and Omeka
Digital Collection Management with CONTENTdm and OmekaDigital Collection Management with CONTENTdm and Omeka
Digital Collection Management with CONTENTdm and Omeka
 
How to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 stepsHow to start: Setting up an open access repository in 22 steps
How to start: Setting up an open access repository in 22 steps
 
SPARQL Tutorial
SPARQL TutorialSPARQL Tutorial
SPARQL Tutorial
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
 
Research Data Management: Why is it important?
Research Data Management: Why is it  important?Research Data Management: Why is it  important?
Research Data Management: Why is it important?
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
數位保存的趨勢與實務
數位保存的趨勢與實務數位保存的趨勢與實務
數位保存的趨勢與實務
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 
Koha-3.14.13: OPAC Customization
Koha-3.14.13: OPAC Customization Koha-3.14.13: OPAC Customization
Koha-3.14.13: OPAC Customization
 
Scholarly vs popular and CRAAP.pptx
Scholarly vs popular and CRAAP.pptxScholarly vs popular and CRAAP.pptx
Scholarly vs popular and CRAAP.pptx
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Introduction to Open Science and EOSC
Introduction to Open Science and EOSCIntroduction to Open Science and EOSC
Introduction to Open Science and EOSC
 
Academic Social Networks : Challenges and opportunities. 7th UNICA Scholarly ...
Academic Social Networks : Challenges and opportunities. 7th UNICA Scholarly ...Academic Social Networks : Challenges and opportunities. 7th UNICA Scholarly ...
Academic Social Networks : Challenges and opportunities. 7th UNICA Scholarly ...
 
SMART Library
SMART LibrarySMART Library
SMART Library
 
End-to-End Analysis of a Domain Generating Algorithm Malware Family
End-to-End Analysis of a Domain Generating Algorithm Malware FamilyEnd-to-End Analysis of a Domain Generating Algorithm Malware Family
End-to-End Analysis of a Domain Generating Algorithm Malware Family
 
Invisible Web
Invisible Web Invisible Web
Invisible Web
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta Sharing
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 

Similaire à Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Shawn Jones
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
Frederick Zarndt
 
Future for rare books
Future for rare booksFuture for rare books
Future for rare books
laura_payne
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
Frederick Zarndt
 
Museums and the Commons: Helping Makers Get Stuff Done
Museums and the Commons: Helping Makers Get Stuff DoneMuseums and the Commons: Helping Makers Get Stuff Done
Museums and the Commons: Helping Makers Get Stuff Done
Michael Edson
 
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
Peter Broadwell
 

Similaire à Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense (20)

Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
Beyond the Silos of the LAMs - Library, Archive, Museum Collaboration
Beyond the Silos of the LAMs - Library, Archive, Museum CollaborationBeyond the Silos of the LAMs - Library, Archive, Museum Collaboration
Beyond the Silos of the LAMs - Library, Archive, Museum Collaboration
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
 
Future for rare books
Future for rare booksFuture for rare books
Future for rare books
 
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
2013 ifla satellite zarndt et al [marketing cultural heritage digital collect...
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipity
 
Museums and the Commons: Helping Makers Get Stuff Done
Museums and the Commons: Helping Makers Get Stuff DoneMuseums and the Commons: Helping Makers Get Stuff Done
Museums and the Commons: Helping Makers Get Stuff Done
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
The East Asian Studies Macroscope: Infrastructure for Collaborative Scholars...
 
History in focus
History in focusHistory in focus
History in focus
 
Ib 2013
Ib 2013Ib 2013
Ib 2013
 
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social MediaWho, Why & How We Serve: Healthcare Communities, Librarians & Social Media
Who, Why & How We Serve: Healthcare Communities, Librarians & Social Media
 
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
Who, Why & How We Serve: The Evolution of Collaborative Librarianship Through...
 
Eureka! research
Eureka! researchEureka! research
Eureka! research
 
The DPLA and NY Heritage for Tech Camp 2014
The DPLA and NY Heritage for Tech Camp 2014The DPLA and NY Heritage for Tech Camp 2014
The DPLA and NY Heritage for Tech Camp 2014
 
The Library in the Life of the User: Two Collection Directions
The Library in the Life of the User: Two Collection DirectionsThe Library in the Life of the User: Two Collection Directions
The Library in the Life of the User: Two Collection Directions
 
The Future of Libraries (for beginners)
The Future of Libraries (for beginners)The Future of Libraries (for beginners)
The Future of Libraries (for beginners)
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
LLA 2014: Why Not Wikipedia?
LLA 2014: Why Not Wikipedia?LLA 2014: Why Not Wikipedia?
LLA 2014: Why Not Wikipedia?
 

Plus de Shawn Jones

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
Shawn Jones
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Shawn Jones
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Shawn Jones
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
Shawn Jones
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 

Plus de Shawn Jones (15)

Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
DIRA 2022 Poster -- Abstract Images Have Different Levels of Retrievability P...
 
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
Abstract Images Have Different Levels of Retrievability Per Reverse Image Sea...
 
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
 
Automatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social CardsAutomatically Selecting Striking Images for Social Cards
Automatically Selecting Striking Images for Social Cards
 
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)SHARI(StoryGraph Hypercane ArchiveNow Raintale Integration)
SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration)
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento Toolkit
 
The Many Shapes of Archive-It
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-It
 
Reference Rot
Reference RotReference Rot
Reference Rot
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
 
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
 
Continuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonestContinuous Integration: Finding problems soonest
Continuous Integration: Finding problems soonest
 
A Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven DevelopmentA Brief Introduction to Test-Driven Development
A Brief Introduction to Test-Driven Development
 
Reconstructing the past with media wiki
Reconstructing the past with media wikiReconstructing the past with media wiki
Reconstructing the past with media wiki
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense

  • 1. Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives Shawn M. Jones Los Alamos National Laboratory Research Library Prototyping Team Web Science and Digital Libraries Research Group Old Dominion University Dissertation Defense 2021/08/05 1 Thanks to:
  • 2. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 2
  • 3. @shawnmjones @StormyArchives November 8, 2019 3 During the second week of November 2019, the National Center for Medical Intelligence shared intelligence based on "monitoring of internal Chinese communications" that warned of a potential novel coronavirus pandemic coming out of Wuhan. Source: https://en.wikipedia.org/wiki/Timeline_of_the_COVID- 19_pandemic_in_2019 COVID-19 was not named and was only known to a small group in the US. No news coverage existed.
  • 4. @shawnmjones @StormyArchives December 16, 2019 4 The first documented COVID-19 hospital admission was on December 16, 2019. COVID-19 was still not well known and received no news coverage.
  • 5. @shawnmjones @StormyArchives January 13, 2020 5 One month later, CNN carries a coronavirus category on its front page.
  • 6. @shawnmjones @StormyArchives February 28, 2020 6 Another month goes by with more front-page articles about coronavirus.
  • 7. @shawnmjones @StormyArchives March 13, 2020 7 A month later, CNN had many front-page articles about coronavirus with a special Coronavirus heading for more articles.
  • 8. @shawnmjones @StormyArchives March 20, 2020 8 A week later, states are locking down.
  • 9. @shawnmjones @StormyArchives March 27, 2020 9 A week later, the US has the most cases of any country.
  • 10. @shawnmjones @StormyArchives A web archive helped me tell this story. 10 These mementos are stored in the Internet Archive. They are full captures of the web code that existed on those dates.
  • 11. @shawnmjones @StormyArchives What other stories can we tell with web archives? 11 Motivation and Research Questions
  • 12. @shawnmjones @StormyArchives Natasha is studying how disasters shape cultures... 12 Sources like Wikipedia now have a summary of the event after the fact. Today she is reviewing the South Louisiana Flood of 2016. Motivation and Research Questions She wants to know about the news reporting as it was at the time of the event.
  • 13. @shawnmjones @StormyArchives Per Nwala et al., news articles about the event tend to slide down search results as we get further from the event. 13 Motivation and Research Questions Green = coverage of event Red = Summaries of the event A. C. Nwala, M. C. Weigle, and M. L. Nelson, “Scraping SERPs for Archival Seeds: It Matters When You Start,” in ACM/IEEE JCDL, 2018. https://doi.org/10.1145/3197026.3197056. She knows that five years later, it is harder to find news articles from the event itself.
  • 14. @shawnmjones @StormyArchives Natasha also knows that news articles are updated with more current and correct information 14 She wants to know about the news reporting as it was at the time of the event. Motivation and Research Questions Today 8/14/2016 during event
  • 15. @shawnmjones @StormyArchives Natasha knows that any time that we need proof that X said Y at date D, we need web archives 15 She knows that web archives contain not just “screenshots” but full captures of web code as mementos. To start, she must know a URL and capture datetime. Then she can view a memento. And she can review its code, if needed. Motivation and Research Questions
  • 16. @shawnmjones @StormyArchives Natasha also knows that archivists create web archive collections based on a theme 16 Motivation and Research Questions
  • 17. @shawnmjones @StormyArchives With these themed collections, she can discover documents that once existed and match her event or topic 17 Virginia Tech: Crisis, Tragedy, and Recovery Network capturing coverage of the 2011 Tucson Shootings University of Utah capturing its web presence over time Motivation and Research Questions
  • 18. @shawnmjones @StormyArchives Natasha has discovered multiple sites with themed web archive collections 18 Library of Congress Archive-It (by the Internet Archive) Trove Conifer Each site has different capabilities and different types of collections. Motivation and Research Questions
  • 19. @shawnmjones @StormyArchives Natasha chooses to look through the themed collections at Archive-It 19 As a popular subscription service of the Internet Archive, Archive-It helps archivists create themed collections. These collections consist of seeds. Mementos are observations of a seed at different points in time. For each seed, there are multiple mementos. This seed has 7 mementos (captured 7 times). Motivation and Research Questions
  • 20. @shawnmjones @StormyArchives There are multiple collections about the subject, which one should she work with? 20 This is not the only disaster she is studying. She needs to waste as little time as possible. Motivation and Research Questions
  • 21. @shawnmjones @StormyArchives 21 Natasha is not alone, 44 Archive-It collections match the search query “human rights” How are they different from each other? Which one is best for her needs? Motivation and Research Questions
  • 22. @shawnmjones @StormyArchives Rustam needs to study how the Boston Marathon Bombing unfolded… 22 Reviewing different mementos of the same seed allows Rustam to understand when the public learned of different events, including when misinformation was corrected. Rather than digging through collections manually, how can Rustam discover and view this more quickly? Motivation and Research Questions
  • 23. @shawnmjones @StormyArchives Olayinka wants to understand what different news sources revealed on the same day… 23 Today she is trying to understand the different reporting on the September 11th Attacks. How can Olayinka discover and view this more quickly? Motivation and Research Questions
  • 24. @shawnmjones @StormyArchives Elbert is an archivist who wants to promote his collections, so others are aware of them… 24 He wants to help visitors like Natasha, Rustam, and Olayinka notice his collections and use them. How does he create enticing visualizations that people can understand with minimal effort? Motivation and Research Questions
  • 25. @shawnmjones @StormyArchives Ling is an archivist who inherited a collection from another archivist, and she needs to understand it so she can make decisions about it… 25 Her collection has hundreds of thousands of seeds. Her predecessor did not provide much metadata with the collection. Archivists can add metadata to collections, but many Archive-It collections contain little metadata. The more metadata a reader needs to understand a collection, the less they have available. Motivation and Research Questions
  • 26. @shawnmjones @StormyArchives Ling knows she is not alone – the collections are often built automatically, making it difficult to know what they contain 26 Web Archiving Technical Lead of the British Library Ling knows that the automation makes it expensive to add metadata to thousands of documents after they are collected. Motivation and Research Questions
  • 27. @shawnmjones @StormyArchives All these personas need a faster method of collection understanding 27 Persona Natasha Rustam Olayinka Elbert Ling Information need Quickly compare collections Follow a source over time Understand a time from different sources Promote collections and help visitors understand them Understand a collection that they inherited Role Visitor Visitor Visitor Archivist Archivist Understanding needs Overall collection Aspect (Page) of a collection Aspect (Time) of a collection Overall collection Overall collection Motivation and Research Questions
  • 28. @shawnmjones @StormyArchives All are faced with more than 14,000 collections at Archive-It alone 28 More than 14,000 collections exist as of the end of 2020 0 500 1000 1500 2000 2500 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 # of Collections Year # of New Archive-It Collections Per Year All Collections Only Private Collections Only Public Collections Motivation and Research Questions
  • 29. @shawnmjones @StormyArchives The problem, summarized 29 § There are multiple collections about the same concept. § It is difficult to easily expose aspects (e.g., time, page) of collections. § The metadata for each collection is non-existent, or inconsistently applied. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Human review of these mementos for collection understanding is an expensive proposition. Motivation and Research Questions
  • 30. @shawnmjones @StormyArchives Our proposal: a visualization made of exemplar mementos 30 § Our visualization is a summary that will act like an abstract § Pirolli and Card’s Information Foraging Theory: § maximize the value of the information gained from our summaries § minimize the cost of interacting with the collection § ensure that our exemplar mementos have good information scent § contain cues that the memento will address a user’s needs From this: 318 seeds with 2421 mementos To something like this: a social media story of ~28 surrogates P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20 Motivation and Research Questions
  • 31. @shawnmjones @StormyArchives Users already interact with pages like this every day 31 A story on Wakelet about the 2021 Capitol Attack Motivation and Research Questions A Twitter Moment of astronaut Michael Collins Twitter creates Moments that present surrogates linking to content about a topic of interest. Educators, librarians, and others create stories on Wakelet about different subjects.
  • 32. @shawnmjones @StormyArchives Social media stories apply visualizations that users already know how to understand 32 An individual surrogate summarizes a web resource. When we combine surrogates into a story, we summarize a topic. Motivation and Research Questions
  • 33. @shawnmjones @StormyArchives We developed a five-process storytelling model based on existing work on summarization and storytelling 33 exemplar mementos collection title: 2013 Boston Marathon Bombing collected by: Internet Archive Global Events collection URL image data... seed data... top terms top entities... title: Boston Marathon Explosions... description: “The grace this tragedy exposed...” striking image.. Select Exemplars Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story AlNoamany found that popular stories contain 28 elements, so we have a target of 28 exemplars. AlNoamany pioneered this work combining web archive collections with Storify, but Storify is now gone. Motivation and Research Questions
  • 34. @shawnmjones @StormyArchives Our five-process storytelling model maps to our research questions 34 RQ1: What types of web archive collections exist and what are their structural features? RQ2: What approaches work best for selecting exemplars from web archive collections? RQ3: What surrogates work best for understanding groups of mementos? RQ4: What methods that automate the creation of surrogates produce results that best match humans’ behavior? Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Examples and Use Cases for our Personas Motivation and Research Questions
  • 35. @shawnmjones @StormyArchives Our Dark and Stormy Archives Tools serve as a reference implementation of our storytelling process 35 Motivation and Research Questions
  • 36. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 36
  • 37. @shawnmjones @StormyArchives URIs identify resources 37 T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax”. https://www.rfc-editor.org/rfc/rfc3986.txt, 2005. Jacobs, I. and Walsh, N. eds., “Architecture of the World Wide Web, Vol. 1.” https://www.w3.org/TR/webarch/, 2003. URIs are a superset of identifiers that contains URLs, URNs, etc. Background and Related Work URIs identify resources, which have different representations depending on the visitor’s needs.
  • 38. @shawnmjones @StormyArchives HTML is the file format we use for web resources 38 HTML contains links to other pages, identified by URIs. Background and Related Work
  • 39. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 39 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  • 40. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 40 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  • 41. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 41 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  • 42. @shawnmjones @StormyArchives Web archives apply crawlers to quickly visit pages and follow links to build collections 42 Background and Related Work Crawlers save web resources in the WARC file format. WARC/1.0 WARC-Date: 2016-04-01T18:08:53Z WARC-Type: response WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6> WARC-Target-URI: http://philadelphia.feb.gov/ WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG Content-Type: application/http; msgtype=response Content-Length: 19556 HTTP/1.0 200 OK server: Apache-Coyote/1.1 content-type: text/html;charset=utf-8 content-length: 35917 date: Sat, 08 May 2021 16:04:56 GMT <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf- 8" /> ... The page to be crawled is a seed or original resource. An observation of that original resource at a specific time is a memento. We use the term URI-M to denote a memento URI. The datetime of a memento’s capture is its memento-datetime.
  • 43. @shawnmjones @StormyArchives A TimeMap gives us a listing of the mementos available for an original resource 43 Background and Related Work the original resource “now” <http://www.cs.odu.edu>;rel="original", <https://web.archive.org/web/19970102130137/http://cs.odu.edu:80/>;rel="memento"; datetime="Thu, 02 Jan 1997 13:01:37 GMT", <https://web.archive.org/web/19970606105039/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT", <http://archive.md/19970606105039/http://www.cs.odu.edu/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT", <https://web.archive.org/web/19971010201632/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 10 Oct 1997 20:16:32 GMT", <https://web.archive.org/web/19971211124211/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Thu, 11 Dec 1997 12:42:11 GMT", ... <https://web.archive.org/web/19990502033600/http://cs.odu.edu:80/>;rel="memento"; datetime="Sun, 02 May 1999 03:36:00 GMT", ... <https://arquivo.pt/wayback/20091223043049mp_/http://www.cs.odu.edu/>;rel="memento"; datetime="Wed, 23 Dec 2009 04:30:49 GMT", ... memento from 1997 memento from 1999 memento from 2009 Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
  • 44. @shawnmjones @StormyArchives Others have tackled portions of the problem of summarizing web archives, but only AlNoamany addressed all processes 44 Background and Related Work Some have conflated our steps of generating metadata and visualizing it. Many have and continue to focus on selecting exemplar words, sentences, images, video clips, and more for summarization. Those who have evaluated surrogates in the past focused on if the participant chose the correct search engine result, but not understanding. Attempts to manually apply metadata to these collections are impacted by the scale of the problem.
  • 45. @shawnmjones @StormyArchives AlNoamany identified the characteristics of social media stories and Archive-It collections 45 Background and Related Work Select Exemplars Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story By analyzing the characteristics of stories and collections, she determined that popular stories contain 28 elements. Our model maps to hers but expands her visualize step. AlNoamany’s sieve diagram gives us one solution for storytelling. We will explore others. Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  • 46. @shawnmjones @StormyArchives Select Exemplars AlNoamany extracted some story metadata and relied on Storify to create and distribute the resulting visualization. 46 Background and Related Work Generate Story Metadata Generate Document Metadata Visualize The Story Distribute The Story Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  • 47. @shawnmjones @StormyArchives Her proof-of-concept generated some document metadata and relied on Storify to generate the rest. 47 Background and Related Work Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  • 48. @shawnmjones @StormyArchives She generated many different stories based on exemplars selected by her proof-of-concept 48 Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Background and Related Work Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  • 49. @shawnmjones @StormyArchives Through a user study, she demonstrated that participants could tell the difference between her solution’s stories and randomly generated stories 49 Background and Related Work Participants could not tell the difference between her solution’s stories and those generated by human archivists Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From Archived Collections,” in ACM Web Science, pp. 309–318, 2017. https://doi.org/10.1145/3091478.3091508.
  • 50. @shawnmjones @StormyArchives Unfortunately, her solution is difficult to generalize 50 Generate Story Metadata Generate Document Metadata Select Exemplars Visualize The Story Distribute The Story Storify AlNoamany’s Proof-of-Concept (POC) Both POC and Storify Generated Portions of Document Metadata Background and Related Work Adobe shut down the Storify platform in 2018. AlNoamany’s POC focused on Archive-It.
  • 51. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 51
  • 52. @shawnmjones @StormyArchives As collection users, what structural features can we view from outside? 52 § Using only structural features is advantageous because it saves one from having to download a collection’s content. § These structural features give us different insight than can be provided by text analysis or metadata. 81,014 seeds 486,227 seed mementos Structural features shown here: • number of seeds • number of mementos S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  • 53. @shawnmjones @StormyArchives Was the collection built from web sites belonging to one domain or many? 53 Many domains One domain Structural feature discussed here: • domain diversity S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  • 54. @shawnmjones @StormyArchives Were most of the web pages in the collection top-level pages or specific articles deeper in a web site? 54 Top-level pages Deeper links Structural feature discussed here: • path depth diversity • most frequent path depth S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  • 55. @shawnmjones @StormyArchives Growth curves provide some understanding of collection curation behavior 55 • Skew of the collection’s holdings • Indicates temporality of collection • Skew of the curatorial involvement with the collection • When seeds were added • When interest was lost or regained (Positive) (Positive) (Negative) (Negative) S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  • 56. @shawnmjones @StormyArchives We discovered four semantic categories in Archive-It collections 56 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 57. @shawnmjones @StormyArchives Self-Archiving collections dominate Archive-It 57 54.1% of collections 27.6% 14.1% In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Subject-based Time Bounded – Expected Time Bounded – Spontaneous 4.2% Organizations archiving themselves or those they are responsible for. S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 58. @shawnmjones @StormyArchives Subject-based collections come in second 58 27.6% of collections 14.1% In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Time Bounded – Expected Time Bounded – Spontaneous 4.2% Collections centered on a subject that is not ephemeral. 54.1% Self-archiving S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 59. @shawnmjones @StormyArchives Time Bounded – Expected collections summarize events we anticipate 59 14.1% of collections In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Time Bounded – Spontaneous 4.2% Collections about an anticipated event. 54.1% Self-archiving 27.6% Subject-based S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 60. @shawnmjones @StormyArchives 60 4.2% of collections In a study of 3,382 Archive-It collections Selecting Exemplars and Generating Story Metadata Collections about an unexpected event. Some of these were evaluated by AlNoamany. 54.1% Self-archiving 27.6% Subject-based 14.1% Time Bounded – Expected Time Bounded – Spontaneous collections summarize unexpected events S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 61. @shawnmjones @StormyArchives We can bridge the structural to the descriptive… 61 Self-Archiving 54.1% of collections Subject-based 27.6% of collections Time Bounded – Expected 14.1% of collections Time Bounded – Spontaneous 4.2% of collections Some evaluated by AlNoamany Using the structural features mentioned previously, we can predict these semantic categories with a Random Forest classifier with F1 = 0.720 Selecting Exemplars and Generating Story Metadata S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 62. @shawnmjones @StormyArchives RQ1: What types of web archive collections exist and what structural features do they have? 62 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Type % of Archive-It Collections Description Example Collection Self-Archiving 54.1% an organization archiving itself University of Utah Web Archive Subject-based 27.6% seeds bound by single topic Environmental Justice Time Bounded – Expected 14.1% an expected event or time period 2008 Olympics Time Bounded – Spontaneous 4.2% unexpected event Tucson Shootings Based on a manual review of 3,382 Archive-It collections, we classified them into 4 types. Growth curves give us some idea of the curatorial involvement with a collection over time. When selecting exemplars, we need to summarize the collection in terms of time and topic. The shapes of these growth curves indicate how we might cluster in time. This example growth curve shows us that 30% of the seeds were added early in the collection’s life. Structurally, for seeds, we can study the: • distribution of domains • distribution of path depths • most frequent path depth • query string usage Selecting Exemplars and Generating Story Metadata
  • 63. @shawnmjones @StormyArchives Identifying off-topic mementos is key to choosing exemplar mementos 63 Hacked Moved on from topic Collections have a topic. Seeds are selected to support that topic. Mementos are observations of seeds. Some of these versions are off-topic. Excluding these off-topic mementos from consideration is key to selecting exemplars. Web Page Gone Account Suspension S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Selecting Exemplars and Generating Story Metadata
  • 64. @shawnmjones @StormyArchives We found that Word Count had the best F1 score for identifying off-topic mementos 64 We reused AlNoamany’s labeled dataset. She did not try: • Sorensen-Dice • Simhash of raw content • Simhash of TF • Gensim LSI Our word count accuracy came out ahead of AlNoamany’s. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5 Y. AlNoamany and S. M. Jones, “Off-Topic Gold Standard Dataset,” GitHub. 2018. https://github.com/oduwsdl/offtopic-goldstandard-data Selecting Exemplars and Generating Story Metadata
  • 65. @shawnmjones @StormyArchives Filtering off-topic mementos is just one step in a set of algorithmic primitives for selecting exemplars 65 We can filter the collection to get a good set of exemplars and then randomly sample from the remainder. Selecting Exemplars and Generating Story Metadata
  • 66. @shawnmjones @StormyArchives Ordering allows us to create meaning from a list of mementos 66 We can order the collection by some feature and then systematically sample every jth memento from the remainder. Selecting Exemplars and Generating Story Metadata
  • 67. @shawnmjones @StormyArchives filter include-only mementos containing a given pattern Web archive collection exemplars reduces number of pages to consider intention of steps order by descending score order score scores results score results with BM25 scores results orders results Scoring gives us an idea of how well a memento meets the information needs represented by a function 67 We can combine filter, score, and order to create a simple search engine. Selecting Exemplars and Generating Story Metadata
  • 68. @shawnmjones @StormyArchives Clustering based on a feature allows us to imbue subsets of mementos with meaning 68 With these primitives, we can reproduce AlNoamany’s Algorithm which we will now call DSA1. Selecting Exemplars and Generating Story Metadata
  • 69. @shawnmjones @StormyArchives These primitives allow us to create other algorithms for selecting exemplars that tell the story the user desires 69 DSA2 focuses on representing collection growth curves and scoring mementos by their surrogate metadata. DSA3 focuses on mementos that best match the collection topic. DSA4 focuses on finding the most novel mementos in the collection. Selecting Exemplars and Generating Story Metadata
  • 70. @shawnmjones @StormyArchives Search engines are the de-facto method of exploring collections; if we consider them a baseline, then how retrievable are the exemplars produced by DSA algorithms? 70 Selecting Exemplars and Generating Story Metadata We loaded 8 different Archive-It collections into different instances of the SolrWayback web archive search engine. We also executed 4 different DSA algorithms to produce exemplars from these collections. Web archive collection exemplars Web archive collection exemplars exemplars Web archive collection Web archive collection exemplars
  • 71. @shawnmjones @StormyArchives We then generated queries with four different methods based on the content of the exemplars produced by each DSA algorithm 71 Selecting Exemplars and Generating Story Metadata
  • 72. @shawnmjones @StormyArchives We visualized the percentage of exemplars that were never retrieved by any query 72 Selecting Exemplars and Generating Story Metadata x-axis the number of search results to review before we find the exemplar y-axis the percentage of exemplars that have zero retrievability In this graph, we are reporting zero retrievability with: • queries from doc2query-T5 • for exemplars chosen by DSA3 At 10 search results, 57.82% of the exemplars were not retrieved. After 1000 search results, 36.05% of the exemplars were not retrieved.
  • 73. @shawnmjones @StormyArchives For all query methods the DSA algorithms’ exemplars have similar retrievability 73 Selecting Exemplars and Generating Story Metadata
  • 74. @shawnmjones @StormyArchives For all query methods the DSA algorithms’ exemplars have similar retrievability 74 Selecting Exemplars and Generating Story Metadata If all pages are relevant, then DSA algorithms produce mementos with more novelty than standard query methods can with a state-of-the-art web archive search engine. DSA4 was designed to surface more novel mementos and meets its goal in these results.
  • 75. @shawnmjones @StormyArchives RQ2: Which approaches work best for selecting exemplars from web archive collections? 75 We established that four different algorithms produced from these primitives will select exemplars that were not retrievable using standard query methods and a state-of-the-art web archive search engine. Removing off-topic mementos is but one step toward selecting exemplars. We devised a set of primitives for creating many different types of sampling algorithms that consider structural features. An important step in selecting exemplars to summarize the collection is identifying off-topic mementos. We found that word count differences work best. S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87 Selecting Exemplars and Generating Story Metadata
  • 76. @shawnmjones @StormyArchives We implemented these primitives as part of Hypercane 76 Hypercane was used to conduct the experiments in this section. Selecting Exemplars and Generating Story Metadata S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson. 2021. Hypercane: Intelligent Sampling for Web Archive Collections. In ACM/IEEE JCDL 2021. [to be published in September 2021]
  • 77. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 77
  • 78. @shawnmjones @StormyArchives We evaluated 55 platforms in 2017 and found that existing social platforms do not reliably produce surrogates for mementos 78 Generating Document Metadata If we cannot rely upon the service to generate a surrogate, we will have to create our own. Which surrogate works best for understanding web archive collections? S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  • 79. @shawnmjones @StormyArchives We reused exemplars that archivists had selected to describe their own collections to create stories with different surrogates... 79 Generating Document Metadata
  • 80. @shawnmjones @StormyArchives Archive-It like surrogates visualize these mementos as they are on Archive-It 80 Archive-It like surrogate S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 81. @shawnmjones @StormyArchives Browser thumbnails are screenshots of the page in a browser 81 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata Browser thumbnails Browser thumbnails are a popular surrogate type used at web archives. This is a screenshot of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 82. @shawnmjones @StormyArchives Social cards come from social media platforms 82 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata Social cards Social cards are a type of surrogate typically found on social media platforms like Facebook or Twitter. These social cards were specially designed to include information from web archives. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 83. @shawnmjones @StormyArchives sc/t combines social cards and thumbnails 83 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc/t We replaced the striking image of the social card with a browser thumbnail. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 84. @shawnmjones @StormyArchives sc+t places the social card to the left and a thumbnail to the right 84 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc+t Our thought was that more information was better. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 85. @shawnmjones @StormyArchives sc^t is interactive 85 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata sc^t When a user hovers over the striking image the browser thumbnail appears. This provides both types of surrogates in a smaller space. This is a screenshot of a subset of the exemplars selected by the archivists of the Archive-It collection Egypt Politics and Revolution.
  • 86. @shawnmjones @StormyArchives We then presented these stories to Mechanical Turk (MT) participants 86 Archive-It like Social Card Browser thumbnails Social Card With Thumbnail as Image (sc/t) Social Card With Thumbnail to Right (sc+t) Social Card with Thumbnail on Hover (sc^t) • 4 stories of 15-17 URI-Ms selected by human Archive-It curators from their collections • 6 different surrogate types • 24 different story-surrogate combinations • 120 MT participants • Given 30 seconds to view each story S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  • 87. @shawnmjones @StormyArchives And then asked them which of the following come from the same collection… 87 • Each participant was shown a list of 6 surrogates of the same type as the story they just viewed. • They were asked to choose the 2 that they thought came from the same collection. • They were given as much time as they wished to answer the question. • This process is like the Sentence Verification Task used in reading comprehension studies. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  • 88. @shawnmjones @StormyArchives Social cards probably outperform the Archive-It surrogate for participant’s correct answers 88 0 0.5 1 1.5 2 2.5 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Correct Answers Per Surrogate Median Mean p = 0.0569 p = 0.0770 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  • 89. @shawnmjones @StormyArchives Social cards produced less interaction while participants viewed their stories 89 We measured clicks and hovers by participants while they were viewing their stories. For browser thumbnails alone, most of the participants clicked the link to view the actual memento behind the surrogate. S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Generating Document Metadata
  • 90. @shawnmjones @StormyArchives RQ3: What surrogates work best for understanding groups of mementos? 90 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039. Correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate • 4 stories of 15-17 mementos selected by human curators from their own collections • 6 different surrogate types • 24 different story-surrogate combinations • Each given 30 seconds to view a story, then asked a question From a user study with 120 Mechanical Turk participants: With social cards, users were able to correctly answer our questions without as much interaction. Generating Document Metadata
  • 91. @shawnmjones @StormyArchives Social cards are generated based on the HTML metadata that authors provide og:title -or- twitter:title -or- <title> og:description -or- twitter:description -or- description og:image -or- twitter:image Without twitter:card and og:title or twitter:title, Twitter gives up and does not generate a card. Facebook parses the <title> and produces a card with just a title. S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899. 91 Generating Document Metadata What do we do if this metadata does not exist?
  • 92. @shawnmjones @StormyArchives We analyzed 277,724 news articles captured by the Internet Archive from 1998 to 2016, and found different rates of metadata adoption OGP = Open Graph Protocol Facebook Cards 150 billion documents in the Internet Archive were captured before 2010 and thus have no card metadata 92 Generating Document Metadata S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
  • 93. @shawnmjones @StormyArchives By applying author behavior, we can generate descriptions 93 Generating Document Metadata We used the existing field values, written by page authors, as ground truth data. It tells us that authors tend to write card descriptions that have the following lengths: • 268 characters • 52 words • 2 sentences We can use this length as input to automatic text summarization algorithms.
  • 94. @shawnmjones @StormyArchives Generating Document Metadata If no metadata exists, we can select a striking image from the images available in the document Which of the images outlined in red is the striking one chosen by the author? How would a machine know which one to choose if there were no striking image specified in the metadata? 94
  • 95. @shawnmjones @StormyArchives Our generic image selection approach has 3 steps 1. Score each image in the document by some approach (e.g., ML probability, feature value) 2. Sort the list of images by descending score (e.g., highest ML probability is first, image with most colors is first) 3. Choose the image at the beginning of the list (highest scoring) 154,131 colors Sorted by color count Sorted by classifier probability 48,020 colors 44,737 colors 30,940 colors 0.3623 0.1948 0.1259 3,816 colors 0.1116 0.11 (resized) (cropped) (resized) (cropped) (larger) 95 Generating Document Metadata
  • 96. @shawnmjones @StormyArchives We visualized how well different approaches performed at choosing a striking image that was perceptually the same as the author’s 96 The best approach starts here As we proceed to the right, we accept more images as perceptually equal to the one selected by the approach All lines converge as any image becomes acceptable as correct Higher scores indicate more accurate answers Remember: we are trying to find the approach that best selects the striking image chosen by the author Generating Document Metadata S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
  • 97. @shawnmjones @StormyArchives We found that Random Forest performed best with base image features quickly calculated via standard image libraries 97 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505. Generating Document Metadata P@1=0.831 MRR=0.883 base image features: • byte size • width in pixels • height in pixels • negative space (# of histogram cols = 0) • size in pixels • aspect ratio • number of colors
  • 98. @shawnmjones @StormyArchives RQ4: What methods that automate the creation of surrogates produce results that best match humans' behavior? 98 Generating Document Metadata Authors write card descriptions that are 268 characters, 52 words, or 2 sentences long. We can use this length as input to automatic text summarization algorithms, like TextRank. With base image features Random Forest performed best for choosing the same striking image as the author. S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.] We analyzed the metadata usage of news article mementos over time. Metadata fields associated with cards had astronomical growth.
  • 99. @shawnmjones @StormyArchives We implemented these results as part of MementoEmbed 99 Cards Browser Thumbnails Imagereels Word Clouds Generating Document Metadata As an archive-aware surrogate service, MementoEmbed provides different types of surrogates for mementos. It also has an extensive API for generating document metadata. S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  • 100. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 100
  • 101. @shawnmjones @StormyArchives Because Storify was gone, we created Raintale for visualizing and distributing stories 101 Visualizing And Distributing Stories S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137 Storify provided an API, allowing us to configure the look and feel of our story. With this functionality gone, we created Raintale, a platform agnostic storytelling tool that generates files or social media posts.
  • 102. @shawnmjones @WebSciDL Remember, Elbert wants to promote his collections for others, and he uses the DSA Toolkit to do so 102 Today he is promoting a collection about COVID-19. Visualizing And Distributing Stories From this: 23,376 mementos To this: a sample of 36 mementos visualized as social cards, phrases, and images S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  • 103. @shawnmjones @StormyArchives Elbert applies all processes of our storytelling model 103 Visualizing And Distributing Stories Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  • 104. @shawnmjones @WebSciDL Remember, Natasha needs to compare collections to each other 104 Today she is reviewing different collections about shootings. Virginia Tech El Paso Norway Visualizing And Distributing Stories S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
  • 105. @shawnmjones @StormyArchives Ling inherited a collection and needs to know what it contains 105 Ling can apply our processes with a different template to include other information, like structural features. Visualizing And Distributing Stories To this: 50 exemplars, structural features, metadata analysis, growth curves, and more From this: 88,755 mementos and no metadata
  • 106. @shawnmjones @WebSciDL Rustam wants to see how a page changed over time 106 Visualizing And Distributing Stories Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Rustam uses Hypercane to help him choose a page and then view its change over time.
  • 107. @shawnmjones @StormyArchives Rustam chooses one of Raintale’s default templates because he is using the DSA Toolkit for exploration 107 Visualizing And Distributing Stories Rustam’s story seems plain, but he is really interested in the changing text over time.
  • 108. @shawnmjones @WebSciDL Olayinka wants to see what different news sources said on the same day in different years 108 Visualizing And Distributing Stories With our SHARI process, she can compare different years to each other 2018 US Elections 2020 COVID-19 2019 Mass shootings in El Paso and Dayton S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  • 109. @shawnmjones @StormyArchives Olayinka can look through the stories produced by our SHARI process to perform her comparisons 109 Visualizing And Distributing Stories Our process is not just limited to our implementation, and allows us to incorporate input from other systems, like StoryGraph. Generate Document Metadata Visualize The Story Distribute The Story Select Exemplars Generate Story Metadata S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  • 110. @shawnmjones @WebSciDL Outline 1. Motivation And Research Questions 2. Background And Related Work 3. Selecting Exemplars And Generating Story Metadata 4. Generating Document Metadata 5. Visualizing And Distributing Stories 6. Contributions And Conclusion 110
  • 111. @shawnmjones @StormyArchives We presented a model for storytelling with web archives 111 Contributions
  • 112. @shawnmjones @WebSciDL We established a vocabulary for different types and structural features of collections 112 Type % of Archive-It Collections Description Example Collection Self-Archiving 54.1% an organization archiving itself University of Utah Web Archive Subject-based 27.6% seeds bound by single topic Environmental Justice Time Bounded – Expected 14.1% an expected event or time period 2008 Olympics Time Bounded – Spontaneous 4.2% unexpected event Tucson Shootings Based on a manual review of 3,382 Archive-It collections, we classified them into 4 types. Growth curves give us some idea of the curatorial involvement with a collection over time. Structurally, for seeds, we can study the: • distribution of domains • distribution of path depths • most frequent path depth • query string usage iPres 2018 S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Contributions
  • 113. @shawnmjones @StormyArchives Word count is a fast, effective intra-TimeMap method of identifying off-topic mementos 113 iPres 2018 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Technical problems Page gone Hacking Moving on from topic Contributions
  • 114. @shawnmjones @WebSciDL We devised a set of primitives for intelligently selecting exemplars from web archive collections 114 Contributions
  • 115. @shawnmjones @StormyArchives Hypercane implements our primitives for selecting exemplars 115 ACM/IEEE JCDL 2021 ACM SIGWEB Newsletter 2021 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Intelligent Sampling for Web Archive Collections. In ACM/IEEE Joint Conference on Digital Libraries, 2021. [To be published in 2021] S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Toolkit for Summarizing Large Collections of Archived Pages. In SIGWEB Newsletter Autumn, 2021. [To be published in 2021] Contributions
  • 116. @shawnmjones @WebSciDL We created four different algorithms from these primitives and found that they produce exemplars with low retrievability with a state-of-the-art search engine 116 We applied four different query methods to the mementos surfaced by these algorithms. As designed, our DSA4 algorithm surfaced more novel exemplars than those discoverable via the search engine. We measured mean retrievability and zero retrievability to determine how easy a document was to retrieve with the given query method. Contributions
  • 117. @shawnmjones @StormyArchives Our user study provides engineers support for choosing social cards over other surrogate types 117 From our user study, correct answers per surrogate indicate that social cards probably outperform the Archive-It surrogate With social cards, users were able to correctly answer our questions without as much interaction. ACM CIKM 2019 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM International Conference on Information and Knowledge Management, 2019. https://doi.org/10.1145/3357384.3358039. Contributions
  • 118. @shawnmjones @WebSciDL We established methods for generating the metadata for social cards if it does not exist 118 S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505. ACM Web Science 2021 For choosing striking images, we trained classifiers using base image features (e.g., pixel size, color count) to choose the same striking image that web page authors chose. Random Forest with these base image features performed best. Contributions
  • 119. @shawnmjones @StormyArchives We explored the reasons for metadata adoption 119 S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.] ACM/IEEE JCDL 2021 Many efforts have been made to encourage metadata adoption by web pages authors. Once social card metadata became available, its use skyrocketed! Contributions
  • 120. @shawnmjones @WebSciDL We released MementoEmbed and Raintale as reference implementations for visualizing and distributing stories 120 WADL 2020 WADL 2020 We detailed how to generate document metadata with MementoEmbed and visualize and distribute the story with Raintale. We also provided an example of these processes for a day’s news. Contributions S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137 S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00139
  • 121. @shawnmjones @WebSciDL And I am eager to apply this expertise at Los Alamos National Laboratory’s Information Sciences Division (CCS-3) 121 https://oduwsdl.github.io/dsa-puddles/shawnmjones/
  • 122. @shawnmjones @StormyArchives Using our model and the lessons from these research questions, we have implemented tools to tell stories that summarize web archive collections 122 Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Read the dissertation for • use cases • more example stories • details on experiments • details on these tools • examples with web archives other than Archive-It A sample of future work ideas: • better summary evaluation • augmenting collections with live web metadata • entity/topic cards rather than social cards • summarizing scholar output, project status, scatter/gather interfaces • solving corporate intranet search problems Contributions: • 5-process model for automatic storytelling • vocabulary for types of web archive collections • structural features of web archive collections • word count works best for identifying off- topic mementos • set of primitives for building algorithms • algorithms built with primitives select novel exemplars that standard search engine did not discover • social cards provide better understanding that the existing state of the art web archive surrogates • machine learning can the same select striking images as a page author • Hypercane, MementoEmbed, and Raintale as implementations Conclusion https://oduwsdl.github.io/dsa/
  • 123. @shawnmjones @StormyArchives Using our model and the lessons from these research questions, we have implemented tools to tell stories that summarize web archive collections 123 Generate Story Metadata Select Exemplars Generate Document Metadata Visualize The Story Distribute The Story Read the dissertation for • use cases • more example stories • details on experiments • details on these tools • examples with web archives other than Archive-It A sample of future work ideas: • better summary evaluation • augmenting collections with live web metadata • entity/topic cards rather than social cards • summarizing scholar output, project status, scatter/gather interfaces • solving corporate intranet search problems Contributions: • 5-process model for automatic storytelling • vocabulary for types of web archive collections • structural features of web archive collections • word count works best for identifying off- topic mementos • set of primitives for building algorithms • algorithms built with primitives select novel exemplars that standard search engine did not discover • social cards provide better understanding that the existing state of the art web archive surrogates • machine learning can the same select striking images as a page author • Hypercane, MementoEmbed, and Raintale as implementations Conclusion https://oduwsdl.github.io/dsa/ What story will you tell with web archives?
  • 125. @shawnmjones @StormyArchives As collection users, we view Archive-It collections from outside… 125 • Curators select seeds, which are captured as seed mementos • Deep mementos are created from other pages linked to seeds S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P Selecting Exemplars and Generating Story Metadata
  • 126. @shawnmjones @StormyArchives Response times per surrogate had interesting means, but p-values were not statistically significant at p < 0.05 126 0 20 40 60 80 100 120 140 160 Archive-It Facsimile Browser Thumbnails Social Cards sc+t sc/t sc^t Response Times Per Surrogate Median Mean p = 0.190 p = 0.202 S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
  • 127. @shawnmjones @StormyArchives The Off-Topic Memento Toolkit (OTMT) compares a seed’s first memento with the seed’s other mementos via different measures… Measure Fully Equivalent Score Fully Dissimilar Score Preprocessing Performed OTMT -tm keyword Byte Count 0.0 -1.0 No bytecount Word Count 0.0 -1.0 Yes wordcount Jaccard Distance 0.0 1.0 Yes jaccard Sørensen-Dice 0.0 1.0 Yes sorensen Simhash of Term Frequencies 0 64 Yes simhash-tf Simhash or raw memento 0 64 No simhash-raw Cosine Similarity of TF-IDF Vectors 1.0 0 Yes cosine Cosine Similarity of LSI Vectors 1.0 0 Yes gensim_lsi 127 S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
  • 128. @shawnmjones @StormyArchives Does most of the collection exist earlier or later in its life? 128 This collection was created in March 2010. Most of its mementos come from 2016 – 2018. Most of this collection exists later in its life. Structural feature discussed here: • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 129. @shawnmjones @StormyArchives When did the curator select and archive a collection’s contents? 129 This collection was created in March 2006. Some of the seeds were selected in 2006. Many of the seeds were selected all along its life. It has mementos as recent as July 2018. Structural feature discussed here: • area under the seed growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 130. @shawnmjones @StormyArchives Did the curator create a collection intended to archive new versions of the same web pages repeatedly? 130 This collection was created in June 2014. The seeds were selected toward the beginning of its life. Mementos were captured all during its life. Structural feature discussed here: • area under the seed growth curve • area under the seed memento growth curve • lifespan of the collection S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
  • 131. @shawnmjones @StormyArchives The Memento Protocol provides us a standard method for acquiring information from web archives 131 Background and Related Work Memento gives us TimeGates – identified by URI-G – for finding a specific memento based on its original resource and capture datetime, its memento-datetime. Memento also gives us TimeMaps – identified by URI-T – for listing all of the mementos for an original resource and their memento-datetimes. <http://a.example.org>;rel="original", <http://arxiv.example.net/timemap/http://a.example.org>; rel="self"; type="application/link-format" ; from="Tue, 20 Jun 2000 18:02:59 GMT" ; until="Wed, 21 Jun 2000 04:41:56 GMT", <http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate", <http://arxiv.example.net/web/20000620180259/http://a.example.org>; rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT", <http://arxiv.example.net/web/20000621011731/http://a.example.org>; rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT", <http://arxiv.example.net/web/20000621044156/http://a.example.org>; rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT” ... URI-R URI-T URI-M memento-datetime URI-G Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
  • 132. @shawnmjones @StormyArchives We use surrogates all of the time! 132 Browser Thumbnail (example from UK Web Archive) Text snippet (example from Bing) Social Card (example from Facebook) Text + Thumbnail (example from Internet Archive) S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws- dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html, 2018. Motivation and Research Questions
  • 133. @shawnmjones @WebSciDL Surrogates are not new! Traditional surrogates contain metadata generated by humans to convey aboutness 133 An individual surrogate summarizes an item. Card catalogs, however, were not stories, just manual methods for finding individual items in collections. Motivation and Research Questions
  • 134. @shawnmjones @StormyArchives Surrogates provide a visual summary of the content behind a URI… 134 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI as a browser thumbnail surrogate: The same URI as a social card surrogate: Background and Related Work
  • 135. @shawnmjones @WebSciDL Social media storytelling uses surrogates to provide a “summary of summaries” 135 2 resources are shown from this Wakelet story 6 resources are shown from this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.
  • 136. @shawnmjones @StormyArchives The Problem: Understanding web archive collections is costly 136 § There are multiple collections about the “same concept.” § The metadata for each collection is non- existent, or inconsistently applied. § A seed is a web page to be crawled. § A memento is an observation of a seed at a specific point in time. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Archive-It is a popular platform, but other web archive collection platforms exist (e.g., Library of Congress, Conifer, Trove). § Existing solutions do not handle the time dimension inherent to web archive collections. more seeds = less metadata
  • 137. @shawnmjones @StormyArchives 137 Our Solution: Social media storytelling uses groups of surrogates to provide a “summary of summaries” Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm. We established a five-process model for storytelling with web archive collections A surrogate summarizes a web page. This surrogate type is called a social card. Storytelling is the visualization. Our contribution is the automation that selects the exemplars and metadata that make this story.
  • 138. @shawnmjones @StormyArchives The problem, summarized § There are multiple collections about the same concept. § The metadata for each collection is non-existent, or inconsistently applied. § Many collections have 1000s of seeds with multiple mementos. § There are more than 14,000 collections. § Human review of these mementos for collection understanding is an expensive proposition. 138
  • 139. @shawnmjones @StormyArchives Archive-It allows easy collection creation Archive-It was created by the Internet Archive as a consistent user interface for constructing web archive collections. Curators can supply live web resources as seeds and establish crawling schedules of those seeds to create mementos. 139
  • 140. @shawnmjones @StormyArchives Reviewing mementos manually is costly This collection has 132,599 seeds, many with multiple mementos Some collections have 1000s of seeds Each seed can have many mementos In some cases, this can require reviewing 100,000+ documents to understand the collection 140
  • 141. @shawnmjones @StormyArchives More Archive-It collections are added every year More than 14,000 collections exist as of the end of 2020 141 0 500 1000 1500 2000 2500 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 # of Collections Year # of New Archive-It Collections Per Year All Collections Only Private Collections Only Public Collections
  • 142. @shawnmjones @StormyArchives Latent Semantic Analysis for document clustering 142 LSA utilizes a term-document matrix • rows correspond to terms and columns correspond to documents • elements are typically weighted via TF-IDF • if TF-IDF, then it is proportional to the number of times the terms appear in each document • use single value decomposition to create two new matrices • the last of these matrices contains a set of documents with coordinates for each cluster LSA requires that the user supply the desired number of topics. Dark cells indicate high weights. High weights signify clustering. Wikipedia contributors. (2019, July 26). Latent semantic analysis. In Wikipedia, The Free Encyclopedia. Retrieved 21:31, July 31, 2019, from https://en.wikipedia.org/w/index.php?title=Latent_semantic_analysis&oldid=907976703 it will be difficult to generalize this number across types of collections
  • 143. @shawnmjones @StormyArchives Latent Dirichlet Allocation For a corpus D consisting of M documents each of length Ni 1. Choose where and is a Dirichlet distribution with symmetric parameter which typically is sparse ( ) 2. Choose where and typically is sparse 3. For each of the word positions i, j where and 1. Choose a topic 2. Choose a word Wikipedia contributors. (2019, July 25). Latent Dirichlet allocation. In Wikipedia, The Free Encyclopedia. Retrieved 20:13, July 31, 2019, from https://en.wikipedia.org/w/index.php?title=Latent_Dirichlet_allocation&oldid=907806560 143 K is the number of topics requested by the user M is the number of documents in the corpus N is the number of words is the word distribution for topic k is the topic distribution for document i zij is the topic for the j-th word in document i wij is a specific word in document i *e.g. of multinomial – probability of counts of each side for rolling k-sided die n times it will be difficult to generalize this number across types of collections
  • 144. @shawnmjones @WebSciDL Many have tackled selecting exemplar sentences or images from a document, few have covered selecting exemplar documents from a corpus over time. 144 Background and Related Work We are inspired by these solutions and will apply some of their ideas in a moment. Silva et al. word graphs Silva and Sampaio. 2014. Using Luhn’s Automatic Abstract Method to Create Graphs of Words for Document Visualization. Social Networking. 65-70. https://doi.org/10.4236/sn.2014.32008. R. Sipos et al. 2012. Temporal corpus summarization using submodular word coverage. In ACM CIKM 2012, 754-763. https://doi.org/10.1145/2396761.2396857. Sipos et al. influential author clusters
  • 145. @shawnmjones @StormyArchives Existing tools for web archive collections require that the user have access to WARCs. 145 ArchiveSpark Archives Unleashed Cloud (now part of Archive-It) Archivists are the only ones likely to have that access. We want anyone to be able to summarize a collection. Warclight Background and Related Work Holzmann et al. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In ACM/IEEE JCDL 2016, 83-92. https://doi.org/10.1145/2910896.2910902. Ruest et al. 2014. archivesunleashed/warclight – A Rails engine supporting the discovery of web archives. https://github.com/archivesunleashed/warclight. Deschamps et al. 2019. The Cost of a WARC: Analyzing Web Archives in the Cloud. In ACM/IEEE JCDL 2019, 261-264. https://doi.org/10.1109/JCDL.2019.00043. Stories also need URIs for linking surrogates. WARCs alone cannot do this.
  • 146. @shawnmjones @StormyArchives Existing work on generating story metadata relies on archivists to manually review and annotate each seed or memento 146 Scale is the greatest challenge here. Web archive collections grow quickly, and archivists have a hard time keeping up with the number of documents to annotate. Background and Related Work D. V. Pitti, “Encoded Archival Description,” D-Lib Magazine, vol. 5, no. 11, 1999. https://doi.org/10.1045/november99-pitti. Encoded Archival Description could work, if there were not thousands of documents to annotate.
  • 147. @shawnmjones @StormyArchives Other studies on surrogates did not focus on if participants understood the underlying collection, instead whether participants chose the correct search result for a query 147 These studies did not compare thumbnails to social cards directly. Web archives love using thumbnails, but is there something better for visitors? Background and Related Work
  • 148. @shawnmjones @StormyArchives Others tried to visualize whole collections at once or created solutions specific to a web archive 148 Conta Me Histórias Padia et al. R. Campos et al. 2021. Automatic generation of timelines for past-web events. The Past Web: Exploring Web Archives, 225-242. https: //doi.org/10.1007/978-3-030-63291-5_18. K. Padia, Y. AlNoamany, and M. C. Weigle, “Visualizing digital collections at Archive-It,” in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, (Washington, DC, USA), pp. 15–18, 2012. https://doi.org/10.1145/ 2232817.2232821. Background and Related Work
  • 149. @shawnmjones @StormyArchives Web surrogates provide a visual summary of the content behind a URI… 149 https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,- 109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36 .8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582 Long URI: The same URI as a browser thumbnail surrogate: The same URI as a social card surrogate:
  • 150. @shawnmjones @StormyArchives Social media storytelling uses surrogates to provide a “summary of summaries” 150 2 resources are shown in this Wakelet story 6 resources are shown in this Storify story Each surrogate summarizes a web resource. Each story groups the surrogates, summarizing the topic. We want to use this technique to summarize web archive collections because users are already familiar with this visualization paradigm.