Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections because search engines do not currently represent multiple document versions well. Web archive collections themselves are vast, some containing hundreds of thousands of documents. There are also thousands of collections, many of which cover the same topic. Few collections include standardized metadata. Too many documents from too many collections with not enough metadata makes collection understanding an expensive proposition.
This dissertation establishes a five-process model to assist with web archive collection understanding. This model aims to automatically produce a social media story -- a visualization paradigm with which most web users are already familiar. Each social media story contains surrogates which are summaries of individual documents. These surrogates, when collected together, summarize the overall topic of the story. After applying our storytelling model, they summarize the topic of a web archive collection.
We develop and test a framework to select the best exemplars that represent a collection. We establish that algorithms produced from these primitives select exemplars that are otherwise undiscoverable using conventional search engine methods. We generate story metadata to improve the information scent of a story so users can understand it better. After an analysis showing that existing platforms perform poorly for web archives and a user study establishing the best surrogate type, we generate document metadata for the exemplars with machine learning. We then visualize the story and document metadata together and distribute it to satisfy the information needs of multiple personas who benefit from our model.
Our tools serve as a reference implementation of our Dark and Stormy Archives storytelling model. Hypercane selects exemplars and generates story metadata. MementoEmbed generates document metadata. Raintale visualizes and distributes the story based on the story metadata and the document metadata of these exemplars. By providing understanding at a glance, our stories save users the time and effort of reading thousands of documents and, most importantly, help them understand web archive collections.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense
1. Improving Collection Understanding
For Web Archives With Storytelling:
Shining Light Into
Dark and Stormy Archives
Shawn M. Jones
Los Alamos National Laboratory
Research Library Prototyping Team
Web Science and Digital Libraries Research Group
Old Dominion University
Dissertation Defense 2021/08/05
1
Thanks to:
2. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
2
3. @shawnmjones @StormyArchives
November 8, 2019
3
During the second week of November 2019,
the National Center for Medical Intelligence shared
intelligence based on "monitoring of internal Chinese
communications" that warned of a potential novel
coronavirus pandemic coming out of Wuhan.
Source: https://en.wikipedia.org/wiki/Timeline_of_the_COVID-
19_pandemic_in_2019
COVID-19 was not named and was only
known to a small group in the US.
No news coverage existed.
4. @shawnmjones @StormyArchives
December 16, 2019
4
The first documented COVID-19 hospital
admission was on December 16, 2019.
COVID-19 was still not well known and
received no news coverage.
7. @shawnmjones @StormyArchives
March 13, 2020
7
A month later, CNN had many front-page
articles about coronavirus with a special
Coronavirus heading for more articles.
10. @shawnmjones @StormyArchives
A web archive
helped me tell
this story.
10
These mementos are stored
in the Internet Archive.
They are full captures of the
web code that existed on
those dates.
12. @shawnmjones @StormyArchives
Natasha is studying how disasters shape
cultures...
12
Sources like Wikipedia now have a
summary of the event after the
fact.
Today she is
reviewing the
South Louisiana
Flood of 2016.
Motivation and Research Questions
She wants to know about
the news reporting as it was
at the time of the event.
13. @shawnmjones @StormyArchives
Per Nwala et al., news articles about the event tend to slide
down search results as we get further from the event.
13
Motivation and Research Questions
Green = coverage of event
Red = Summaries of the event
A. C. Nwala, M. C. Weigle, and M. L. Nelson, “Scraping SERPs for Archival Seeds: It Matters
When You Start,” in ACM/IEEE JCDL, 2018. https://doi.org/10.1145/3197026.3197056.
She knows that
five years later,
it is harder to
find news
articles from
the event itself.
14. @shawnmjones @StormyArchives
Natasha also knows that news articles are updated with
more current and correct information
14
She wants to
know about
the news
reporting as it
was at the time
of the event.
Motivation and Research Questions
Today
8/14/2016
during event
15. @shawnmjones @StormyArchives
Natasha knows that any time that we need proof
that X said Y at date D, we need web archives
15
She knows that
web archives
contain not just
“screenshots”
but full
captures of
web code as
mementos.
To start, she must know a
URL and capture
datetime.
Then she can view a
memento.
And she can review its
code, if needed.
Motivation and Research Questions
17. @shawnmjones @StormyArchives
With these themed collections, she can discover documents
that once existed and match her event or topic
17
Virginia Tech: Crisis, Tragedy, and
Recovery Network capturing
coverage of the 2011 Tucson Shootings
University of Utah capturing its
web presence over time
Motivation and Research Questions
18. @shawnmjones @StormyArchives
Natasha has discovered multiple sites with
themed web archive collections
18
Library of Congress
Archive-It
(by the Internet Archive)
Trove
Conifer
Each site has
different
capabilities and
different types of
collections.
Motivation and Research Questions
19. @shawnmjones @StormyArchives
Natasha chooses to look through the
themed collections at Archive-It
19
As a popular subscription service of the Internet
Archive, Archive-It helps archivists create themed
collections.
These collections consist of seeds.
Mementos are observations of a seed at different
points in time.
For each seed, there are multiple mementos.
This seed has 7 mementos (captured 7 times).
Motivation and Research Questions
20. @shawnmjones @StormyArchives
There are multiple collections about the
subject, which one should she work with?
20
This is not the only
disaster she is studying.
She needs to waste as
little time as possible.
Motivation and Research Questions
21. @shawnmjones @StormyArchives 21
Natasha is not alone, 44
Archive-It collections
match the search query
“human rights”
How are they different
from each other?
Which one is best for
her needs?
Motivation and Research Questions
22. @shawnmjones @StormyArchives
Rustam needs to study how the Boston
Marathon Bombing unfolded…
22
Reviewing different
mementos of the
same seed allows
Rustam to
understand when
the public learned of
different events,
including when
misinformation was
corrected.
Rather than digging through collections manually, how can Rustam discover and view this more quickly?
Motivation and Research Questions
23. @shawnmjones @StormyArchives
Olayinka wants to understand what different
news sources revealed on the same day…
23
Today she is
trying to
understand the
different
reporting on the
September 11th
Attacks.
How can Olayinka discover and view this more quickly?
Motivation and Research Questions
24. @shawnmjones @StormyArchives
Elbert is an archivist who wants to promote his
collections, so others are aware of them…
24
He wants to help
visitors like
Natasha, Rustam,
and Olayinka
notice his
collections and use
them.
How does he create enticing visualizations that people can understand with minimal effort?
Motivation and Research Questions
25. @shawnmjones @StormyArchives
Ling is an archivist who inherited a collection from another
archivist, and she needs to understand it so she can make
decisions about it…
25
Her collection has
hundreds of
thousands of seeds.
Her predecessor did
not provide much
metadata with the
collection.
Archivists can add metadata to
collections, but many Archive-It
collections contain little metadata.
The more metadata a reader needs to
understand a collection, the less they
have available.
Motivation and Research Questions
26. @shawnmjones @StormyArchives
Ling knows she is not alone – the collections are often built
automatically, making it difficult to know what they contain
26
Web Archiving Technical Lead of the British Library
Ling knows that the
automation makes it
expensive to add
metadata to
thousands of
documents after
they are collected.
Motivation and Research Questions
27. @shawnmjones @StormyArchives
All these personas need a faster method of
collection understanding
27
Persona Natasha Rustam Olayinka Elbert Ling
Information
need
Quickly
compare
collections
Follow a source
over time
Understand a
time from
different
sources
Promote
collections and
help visitors
understand
them
Understand a
collection that
they inherited
Role Visitor Visitor Visitor Archivist Archivist
Understanding
needs
Overall
collection
Aspect (Page)
of a collection
Aspect (Time)
of a collection
Overall
collection
Overall
collection
Motivation and Research Questions
28. @shawnmjones @StormyArchives
All are faced with more than 14,000
collections at Archive-It alone
28
More than 14,000 collections exist as of the end of 2020
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections
Motivation and Research Questions
29. @shawnmjones @StormyArchives
The problem, summarized
29
§ There are multiple collections
about the same concept.
§ It is difficult to easily expose
aspects (e.g., time, page) of
collections.
§ The metadata for each collection
is non-existent, or inconsistently
applied.
§ Many collections have
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these mementos
for collection understanding is an
expensive proposition.
Motivation and Research Questions
30. @shawnmjones @StormyArchives
Our proposal: a visualization made of
exemplar mementos
30
§ Our visualization is a summary
that will act like an abstract
§ Pirolli and Card’s Information
Foraging Theory:
§ maximize the value of the
information gained from our
summaries
§ minimize the cost of interacting
with the collection
§ ensure that our exemplar
mementos have good
information scent
§ contain cues that the memento
will address a user’s needs
From this:
318 seeds with
2421 mementos
To something like this:
a social media story
of ~28 surrogates
P. Pirolli. 2005. Rational Analyses of Information Foraging on the Web. Cognitive
Science 29, 3 (May 2005), 343–373. DOI:10.1207/s15516709cog0000_20
Motivation and Research Questions
31. @shawnmjones @StormyArchives
Users already interact with pages like this
every day
31
A story on Wakelet about the 2021
Capitol Attack
Motivation and Research Questions
A Twitter Moment of
astronaut Michael Collins
Twitter creates
Moments that
present surrogates
linking to content
about a topic of
interest.
Educators, librarians,
and others create
stories on Wakelet
about different
subjects.
32. @shawnmjones @StormyArchives
Social media stories apply visualizations that
users already know how to understand
32
An individual surrogate summarizes a web resource.
When we combine surrogates into a story, we
summarize a topic.
Motivation and Research Questions
33. @shawnmjones @StormyArchives
We developed a five-process storytelling model based
on existing work on summarization and storytelling
33
exemplar
mementos
collection title: 2013 Boston
Marathon Bombing
collected by: Internet
Archive Global Events
collection URL
image data...
seed data...
top terms
top entities...
title: Boston Marathon
Explosions...
description: “The
grace this tragedy
exposed...”
striking image..
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
AlNoamany found
that popular stories
contain 28 elements,
so we have a target
of 28 exemplars.
AlNoamany
pioneered this work
combining web
archive collections
with Storify, but Storify
is now gone.
Motivation and Research Questions
34. @shawnmjones @StormyArchives
Our five-process storytelling model maps to
our research questions
34
RQ1: What types of web archive
collections exist and what are
their structural features?
RQ2: What approaches work
best for selecting exemplars
from web archive collections?
RQ3: What surrogates work best
for understanding groups of
mementos?
RQ4: What methods that
automate the creation of
surrogates produce results that
best match humans’ behavior?
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Examples and Use Cases for our
Personas
Motivation and Research Questions
35. @shawnmjones @StormyArchives
Our Dark and Stormy Archives Tools serve as a
reference implementation of our storytelling process
35
Motivation and Research Questions
36. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
36
37. @shawnmjones @StormyArchives
URIs identify resources
37
T. Berners-Lee, et al. “RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax”.
https://www.rfc-editor.org/rfc/rfc3986.txt, 2005.
Jacobs, I. and Walsh, N. eds., “Architecture of the World Wide Web, Vol. 1.”
https://www.w3.org/TR/webarch/, 2003.
URIs are a superset of identifiers that
contains URLs, URNs, etc.
Background and Related Work
URIs identify resources, which have
different representations
depending on the visitor’s needs.
38. @shawnmjones @StormyArchives
HTML is the file format we use for web
resources
38
HTML contains links to other
pages, identified by URIs.
Background and Related Work
39. @shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
39
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
40. @shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
40
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
41. @shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
41
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
42. @shawnmjones @StormyArchives
Web archives apply crawlers to quickly visit
pages and follow links to build collections
42
Background and Related Work
Crawlers save web resources in the
WARC file format.
WARC/1.0
WARC-Date: 2016-04-01T18:08:53Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:8ede0256-790b-4378-a469-cfdcf0b9f8a6>
WARC-Target-URI: http://philadelphia.feb.gov/
WARC-Payload-Digest: sha1:USWZWZNSVCOSXA2MBSHMRAVZY7R2ZQSY
WARC-Block-Digest: sha1:NNBOIVAWIP63IESE3JCE36X6BRE2YYZG
Content-Type: application/http; msgtype=response
Content-Length: 19556
HTTP/1.0 200 OK
server: Apache-Coyote/1.1
content-type: text/html;charset=utf-8
content-length: 35917
date: Sat, 08 May 2021 16:04:56 GMT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-
8" />
...
The page to be crawled is a seed or original resource.
An observation of that original resource at a specific time is a
memento.
We use the term URI-M to denote a memento URI.
The datetime of a memento’s capture is its memento-datetime.
43. @shawnmjones @StormyArchives
A TimeMap gives us a listing of the
mementos available for an original resource
43
Background and Related Work
the original resource
“now”
<http://www.cs.odu.edu>;rel="original",
<https://web.archive.org/web/19970102130137/http://cs.odu.edu:80/>;rel="memento"; datetime="Thu, 02 Jan 1997
13:01:37 GMT",
<https://web.archive.org/web/19970606105039/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 06 Jun 1997
10:50:39 GMT",
<http://archive.md/19970606105039/http://www.cs.odu.edu/>;rel="memento"; datetime="Fri, 06 Jun 1997 10:50:39 GMT",
<https://web.archive.org/web/19971010201632/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Fri, 10 Oct 1997
20:16:32 GMT", <https://web.archive.org/web/19971211124211/http://www.cs.odu.edu:80/>;rel="memento"; datetime="Thu,
11 Dec 1997 12:42:11 GMT",
...
<https://web.archive.org/web/19990502033600/http://cs.odu.edu:80/>;rel="memento"; datetime="Sun, 02 May 1999
03:36:00 GMT",
...
<https://arquivo.pt/wayback/20091223043049mp_/http://www.cs.odu.edu/>;rel="memento"; datetime="Wed, 23 Dec 2009
04:30:49 GMT",
...
memento from 1997
memento from 1999 memento from 2009
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
44. @shawnmjones @StormyArchives
Others have tackled portions of the problem of summarizing
web archives, but only AlNoamany addressed all processes
44
Background and Related Work
Some have conflated our
steps of generating
metadata and visualizing it.
Many have and continue to
focus on selecting exemplar
words, sentences, images,
video clips, and more for
summarization.
Those who have evaluated
surrogates in the past focused
on if the participant chose the
correct search engine result,
but not understanding.
Attempts to manually apply metadata
to these collections are impacted by
the scale of the problem.
45. @shawnmjones @StormyArchives
AlNoamany identified the characteristics of social
media stories and Archive-It collections
45
Background and Related Work
Select
Exemplars
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
By analyzing the characteristics
of stories and collections, she
determined that popular
stories contain 28 elements.
Our model maps to hers but
expands her visualize step.
AlNoamany’s
sieve diagram
gives us one
solution for
storytelling. We will
explore others.
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
46. @shawnmjones @StormyArchives
Select
Exemplars
AlNoamany extracted some story metadata and relied on
Storify to create and distribute the resulting visualization.
46
Background and Related Work
Generate
Story
Metadata
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
47. @shawnmjones @StormyArchives
Her proof-of-concept generated some document
metadata and relied on Storify to generate the rest.
47
Background and Related Work
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
48. @shawnmjones @StormyArchives
She generated many different stories based on
exemplars selected by her proof-of-concept
48
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Background and Related Work
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
49. @shawnmjones @StormyArchives
Through a user study, she demonstrated that participants
could tell the difference between her solution’s stories and
randomly generated stories
49
Background and Related Work
Participants could not tell the
difference between her
solution’s stories and those
generated by human archivists
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Generating Stories From
Archived Collections,” in ACM Web Science, pp. 309–318, 2017.
https://doi.org/10.1145/3091478.3091508.
50. @shawnmjones @StormyArchives
Unfortunately, her solution is difficult to generalize
50
Generate
Story
Metadata
Generate
Document
Metadata
Select
Exemplars
Visualize
The
Story
Distribute
The
Story
Storify
AlNoamany’s
Proof-of-Concept
(POC)
Both POC and
Storify Generated
Portions of
Document
Metadata
Background and Related Work
Adobe shut down
the Storify platform
in 2018.
AlNoamany’s POC
focused on Archive-It.
51. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
51
52. @shawnmjones @StormyArchives
As collection users, what structural features
can we view from outside?
52
§ Using only structural features is
advantageous because it saves
one from having to download a
collection’s content.
§ These structural features give us
different insight than can be
provided by text analysis or
metadata.
81,014 seeds
486,227 seed mementos
Structural features shown here:
• number of seeds
• number of mementos
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
53. @shawnmjones @StormyArchives
Was the collection built from web sites belonging
to one domain or many?
53
Many domains One domain
Structural feature
discussed here:
• domain diversity
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
54. @shawnmjones @StormyArchives
Were most of the web pages in the collection top-level
pages or specific articles deeper in a web site?
54
Top-level pages Deeper links
Structural feature
discussed here:
• path depth diversity
• most frequent path
depth
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
55. @shawnmjones @StormyArchives
Growth curves provide some understanding of
collection curation behavior
55
• Skew of the
collection’s holdings
• Indicates
temporality of
collection
• Skew of the curatorial
involvement with the
collection
• When seeds were
added
• When interest was lost
or regained
(Positive) (Positive)
(Negative)
(Negative)
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
56. @shawnmjones @StormyArchives
We discovered four semantic categories in
Archive-It collections
56
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
57. @shawnmjones @StormyArchives
Self-Archiving collections dominate Archive-It
57
54.1% of collections
27.6% 14.1%
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
Subject-based Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Organizations
archiving themselves
or those they are
responsible for.
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
58. @shawnmjones @StormyArchives
Subject-based collections come in second
58
27.6% of collections
14.1%
In a study of 3,382 Archive-It collections
Selecting Exemplars and Generating Story Metadata
Time Bounded
– Expected
Time Bounded
– Spontaneous
4.2%
Collections centered
on a subject that is
not ephemeral.
54.1%
Self-archiving
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
59. @shawnmjones @StormyArchives
Time Bounded – Expected
collections summarize events
we anticipate
59
14.1% of collections
In a study of 3,382
Archive-It collections
Selecting Exemplars and Generating Story Metadata
Time Bounded
– Spontaneous
4.2%
Collections about an
anticipated event.
54.1%
Self-archiving
27.6%
Subject-based
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
60. @shawnmjones @StormyArchives 60
4.2% of collections
In a study of 3,382
Archive-It collections
Selecting Exemplars and Generating Story Metadata
Collections about an
unexpected event.
Some of these were
evaluated by AlNoamany.
54.1%
Self-archiving
27.6%
Subject-based
14.1%
Time Bounded
– Expected
Time Bounded – Spontaneous
collections summarize
unexpected events
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
61. @shawnmjones @StormyArchives
We can bridge the structural to the
descriptive…
61
Self-Archiving
54.1% of collections
Subject-based
27.6% of collections
Time Bounded – Expected
14.1% of collections
Time Bounded – Spontaneous
4.2% of collections
Some evaluated by AlNoamany
Using the structural features mentioned previously, we can predict these
semantic categories with a Random Forest classifier with F1 = 0.720
Selecting Exemplars and Generating Story Metadata
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
62. @shawnmjones @StormyArchives
RQ1: What types of web archive collections exist
and what structural features do they have?
62
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us some idea
of the curatorial involvement with
a collection over time.
When selecting exemplars, we
need to summarize the collection
in terms of time and topic. The
shapes of these growth curves
indicate how we might cluster in
time.
This example growth curve
shows us that 30% of the
seeds were added early in
the collection’s life.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage
Selecting Exemplars and Generating Story Metadata
63. @shawnmjones @StormyArchives
Identifying off-topic mementos is key to choosing
exemplar mementos
63
Hacked
Moved on from topic
Collections have a topic.
Seeds are selected to
support that topic.
Mementos are
observations of seeds.
Some of these versions are
off-topic.
Excluding these off-topic
mementos from
consideration is key to
selecting exemplars.
Web Page Gone
Account Suspension
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Selecting Exemplars and Generating Story Metadata
64. @shawnmjones @StormyArchives
We found that Word Count had the best F1
score for identifying off-topic mementos
64
We reused AlNoamany’s
labeled dataset.
She did not try:
• Sorensen-Dice
• Simhash of raw
content
• Simhash of TF
• Gensim LSI
Our word count
accuracy came out
ahead of AlNoamany’s.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within TimeMaps in Web
archives,” International Journal on Digital Libraries, 2016. https://doi.org/10.1007/s00799016-0183-5
Y. AlNoamany and S. M. Jones, “Off-Topic Gold Standard Dataset,” GitHub. 2018.
https://github.com/oduwsdl/offtopic-goldstandard-data
Selecting Exemplars and Generating Story Metadata
65. @shawnmjones @StormyArchives
Filtering off-topic mementos is just one step in a set of
algorithmic primitives for selecting exemplars
65
We can filter the collection to get a
good set of exemplars and then
randomly sample from the remainder.
Selecting Exemplars and Generating Story Metadata
66. @shawnmjones @StormyArchives
Ordering allows us to create meaning from a
list of mementos
66
We can order the collection by some
feature and then systematically sample
every jth memento from the remainder.
Selecting Exemplars and Generating Story Metadata
67. @shawnmjones @StormyArchives
filter
include-only mementos
containing a given pattern
Web
archive
collection
exemplars
reduces number of pages to consider
intention of steps
order by descending score
order
score scores results
score results with BM25 scores results
orders results
Scoring gives us an idea of how well a memento meets
the information needs represented by a function
67
We can combine filter, score, and order
to create a simple search engine.
Selecting Exemplars and Generating Story Metadata
68. @shawnmjones @StormyArchives
Clustering based on a feature allows us to
imbue subsets of mementos with meaning
68
With these primitives, we can
reproduce AlNoamany’s Algorithm
which we will now call DSA1.
Selecting Exemplars and Generating Story Metadata
69. @shawnmjones @StormyArchives
These primitives allow us to create other algorithms for
selecting exemplars that tell the story the user desires
69
DSA2 focuses on representing collection
growth curves and scoring mementos
by their surrogate metadata.
DSA3 focuses on mementos that best
match the collection topic.
DSA4 focuses on finding the most novel
mementos in the collection.
Selecting Exemplars and Generating Story Metadata
70. @shawnmjones @StormyArchives
Search engines are the de-facto method of exploring
collections; if we consider them a baseline, then how
retrievable are the exemplars produced by DSA algorithms?
70
Selecting Exemplars and Generating Story Metadata
We loaded 8
different Archive-It
collections into
different instances
of the SolrWayback
web archive
search engine.
We also executed
4 different DSA
algorithms to
produce exemplars
from these
collections.
Web
archive
collection
exemplars
Web
archive
collection
exemplars
exemplars
Web
archive
collection
Web
archive
collection
exemplars
71. @shawnmjones @StormyArchives
We then generated queries with four different methods based on
the content of the exemplars produced by each DSA algorithm
71
Selecting Exemplars and Generating Story Metadata
72. @shawnmjones @StormyArchives
We visualized the percentage of exemplars
that were never retrieved by any query
72
Selecting Exemplars and Generating Story Metadata
x-axis
the number of search results to
review before we find the exemplar
y-axis
the percentage
of exemplars that
have zero
retrievability
In this graph, we are reporting
zero retrievability with:
• queries from doc2query-T5
• for exemplars chosen by DSA3
At 10 search results,
57.82% of the exemplars
were not retrieved.
After 1000 search results,
36.05% of the exemplars
were not retrieved.
73. @shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
73
Selecting Exemplars and Generating Story Metadata
74. @shawnmjones @StormyArchives
For all query methods the DSA algorithms’ exemplars
have similar retrievability
74
Selecting Exemplars and Generating Story Metadata
If all pages are relevant, then DSA algorithms produce mementos with more novelty than standard query
methods can with a state-of-the-art web archive search engine.
DSA4 was designed to surface more novel mementos and meets its goal in these results.
75. @shawnmjones @StormyArchives
RQ2: Which approaches work best for selecting
exemplars from web archive collections?
75
We established that four different
algorithms produced from these
primitives will select exemplars
that were not retrievable using
standard query methods and a
state-of-the-art web archive
search engine.
Removing off-topic mementos is but one step
toward selecting exemplars.
We devised a set of primitives for creating
many different types of sampling algorithms
that consider structural features.
An important step in
selecting exemplars to
summarize the collection is
identifying off-topic
mementos. We found that
word count differences
work best.
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
Selecting Exemplars and Generating Story Metadata
76. @shawnmjones @StormyArchives
We implemented these primitives as part of Hypercane
76
Hypercane was used to
conduct the experiments in
this section.
Selecting Exemplars and Generating Story Metadata
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson. 2021. Hypercane: Intelligent Sampling for Web
Archive Collections. In ACM/IEEE JCDL 2021. [to be published in September 2021]
77. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
77
78. @shawnmjones @StormyArchives
We evaluated 55 platforms in 2017 and found that existing social
platforms do not reliably produce surrogates for mementos
78
Generating Document Metadata
If we cannot rely upon the service to generate a surrogate, we will have to create
our own.
Which surrogate works best for understanding web archive collections?
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00137
79. @shawnmjones @StormyArchives
We reused exemplars that archivists had selected to
describe their own collections to create stories with
different surrogates...
79
Generating Document Metadata
80. @shawnmjones @StormyArchives
Archive-It like surrogates visualize these
mementos as they are on Archive-It
80
Archive-It like surrogate
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
81. @shawnmjones @StormyArchives
Browser thumbnails are screenshots of the page in a
browser
81
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
Browser thumbnails
Browser thumbnails
are a popular
surrogate type used
at web archives.
This is a screenshot of the exemplars
selected by the archivists of the
Archive-It collection
Egypt Politics and Revolution.
82. @shawnmjones @StormyArchives
Social cards come from social media
platforms
82
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
Social cards
Social cards are a type
of surrogate typically
found on social media
platforms like
Facebook or Twitter.
These social cards were
specially designed to
include information
from web archives.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
83. @shawnmjones @StormyArchives
sc/t combines social cards and thumbnails
83
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc/t
We replaced the
striking image of the
social card with a
browser thumbnail.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
84. @shawnmjones @StormyArchives
sc+t places the social card to the left and a thumbnail
to the right
84
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc+t
Our thought was that
more information was
better.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
85. @shawnmjones @StormyArchives
sc^t is interactive
85
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
sc^t
When a user hovers
over the striking image
the browser thumbnail
appears.
This provides both types
of surrogates in a smaller
space.
This is a screenshot of a subset of
the exemplars selected by the
archivists of the Archive-It
collection
Egypt Politics and Revolution.
86. @shawnmjones @StormyArchives
We then presented these stories to Mechanical Turk (MT)
participants
86
Archive-It like
Social Card
Browser thumbnails
Social Card With
Thumbnail as Image (sc/t)
Social Card
With Thumbnail
to Right (sc+t)
Social Card with
Thumbnail on
Hover (sc^t)
• 4 stories of 15-17 URI-Ms selected by human
Archive-It curators from their collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• 120 MT participants
• Given 30 seconds to view each story
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
87. @shawnmjones @StormyArchives
And then asked them which of the following come from
the same collection…
87
• Each participant was shown a list of 6 surrogates of the same type as the story they just viewed.
• They were asked to choose the 2 that they thought came from the same collection.
• They were given as much time as they wished to answer the question.
• This process is like the Sentence Verification Task used in reading comprehension studies.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
88. @shawnmjones @StormyArchives
Social cards probably outperform the Archive-It
surrogate for participant’s correct answers
88
0 0.5 1 1.5 2 2.5
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Correct Answers Per Surrogate
Median Mean
p = 0.0569
p = 0.0770
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
89. @shawnmjones @StormyArchives
Social cards produced less interaction while participants
viewed their stories
89
We measured clicks and hovers by participants while they were viewing their stories.
For browser thumbnails alone, most of the participants clicked the link to view the actual
memento behind the surrogate.
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Generating Document Metadata
90. @shawnmjones @StormyArchives
RQ3: What surrogates work best for
understanding groups of mementos?
90
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
Correct answers per surrogate indicate that social cards
probably outperform the Archive-It surrogate
• 4 stories of 15-17 mementos selected by human curators from their own collections
• 6 different surrogate types
• 24 different story-surrogate combinations
• Each given 30 seconds to view a story, then asked a question
From a user study with
120 Mechanical Turk
participants:
With social cards, users were able to
correctly answer our questions without as
much interaction.
Generating Document Metadata
91. @shawnmjones @StormyArchives
Social cards are generated based on the
HTML metadata that authors provide
og:title
-or-
twitter:title
-or-
<title>
og:description
-or-
twitter:description
-or-
description
og:image
-or-
twitter:image
Without twitter:card and og:title or twitter:title, Twitter gives up and does not generate a card.
Facebook parses the <title> and produces a card with just a title.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://arxiv.org/pdf/2103.04899. 91
Generating Document Metadata
What do we do if this
metadata does not exist?
92. @shawnmjones @StormyArchives
We analyzed 277,724 news articles captured by the Internet
Archive from 1998 to 2016, and found different rates
of metadata adoption
OGP = Open Graph Protocol
Facebook Cards
150 billion
documents in the
Internet Archive
were captured
before 2010 and
thus have no card
metadata
92
Generating Document Metadata
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
93. @shawnmjones @StormyArchives
By applying author behavior, we can
generate descriptions
93
Generating Document Metadata
We used the existing field values, written by page authors, as
ground truth data.
It tells us that authors tend to write card descriptions that have
the following lengths:
• 268 characters
• 52 words
• 2 sentences
We can use this length as input to automatic text summarization
algorithms.
94. @shawnmjones @StormyArchives
Generating Document Metadata
If no metadata
exists, we can
select a striking
image from the
images
available in the
document
Which of the images
outlined in red is the
striking one chosen by the
author?
How would a machine
know which one to choose
if there were no striking
image specified in the
metadata?
94
95. @shawnmjones @StormyArchives
Our generic image selection
approach has 3 steps
1. Score each image in the
document by some
approach (e.g., ML
probability, feature
value)
2. Sort the list of images by
descending score (e.g.,
highest ML probability is
first, image with most
colors is first)
3. Choose the image at the
beginning of the list
(highest scoring)
154,131
colors
Sorted by color
count
Sorted by
classifier probability
48,020
colors
44,737
colors
30,940
colors
0.3623
0.1948
0.1259
3,816
colors
0.1116
0.11
(resized)
(cropped)
(resized)
(cropped)
(larger)
95
Generating Document Metadata
96. @shawnmjones @StormyArchives
We visualized how well different approaches performed at
choosing a striking image that was perceptually the same as
the author’s
96
The best
approach
starts here
As we proceed to the right, we
accept more images as
perceptually equal to the one
selected by the approach
All lines converge
as any image
becomes
acceptable as
correct
Higher scores
indicate more
accurate
answers
Remember:
we are trying to find the
approach that best selects
the striking image chosen
by the author
Generating Document Metadata
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
97. @shawnmjones @StormyArchives
We found that Random Forest performed best with base image
features quickly calculated via standard image libraries
97
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. 2021. Automatically Selecting Striking Images
for Social Cards. In ACM WebSci ‘21. https://doi.org/10.1145/3447535.3462505.
Generating Document Metadata
P@1=0.831
MRR=0.883
base image features:
• byte size
• width in pixels
• height in pixels
• negative space
(# of histogram cols = 0)
• size in pixels
• aspect ratio
• number of colors
98. @shawnmjones @StormyArchives
RQ4: What methods that automate the creation of surrogates
produce results that best match humans' behavior?
98
Generating Document Metadata
Authors write card descriptions that
are 268 characters, 52 words, or 2
sentences long. We can use this
length as input to automatic text
summarization algorithms, like
TextRank.
With base image features Random Forest performed best
for choosing the same striking image as the author.
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
We analyzed the metadata
usage of news article mementos
over time.
Metadata fields associated with
cards had astronomical growth.
99. @shawnmjones @StormyArchives
We implemented these results as part of
MementoEmbed
99
Cards
Browser
Thumbnails
Imagereels
Word
Clouds
Generating Document Metadata
As an archive-aware surrogate service, MementoEmbed provides different types of surrogates for mementos.
It also has an extensive API for generating document metadata.
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “MementoEmbed and Raintale
for Web Archive Storytelling,” In Web Archiving and Digial Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00137
100. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
100
101. @shawnmjones @StormyArchives
Because Storify was gone, we created
Raintale for visualizing and distributing stories
101
Visualizing And Distributing Stories
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
Storify provided an API, allowing us to configure
the look and feel of our story.
With this functionality gone, we created
Raintale, a platform agnostic storytelling tool
that generates files or social media posts.
102. @shawnmjones @WebSciDL
Remember, Elbert wants to promote his collections for
others, and he uses the DSA Toolkit to do so
102
Today he is
promoting a
collection about
COVID-19.
Visualizing And Distributing Stories
From this: 23,376 mementos To this: a sample of 36 mementos
visualized as social cards, phrases,
and images
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
103. @shawnmjones @StormyArchives
Elbert applies all processes of our storytelling model
103
Visualizing And Distributing Stories
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
104. @shawnmjones @WebSciDL
Remember, Natasha needs to compare
collections to each other
104
Today she is
reviewing
different
collections
about
shootings.
Virginia Tech El Paso Norway
Visualizing And Distributing Stories
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
105. @shawnmjones @StormyArchives
Ling inherited a collection and needs to
know what it contains
105
Ling can apply our
processes with a
different template
to include other
information, like
structural features.
Visualizing And Distributing Stories
To this: 50 exemplars, structural
features, metadata analysis,
growth curves, and more
From this: 88,755 mementos
and no metadata
106. @shawnmjones @WebSciDL
Rustam wants to see how a page changed
over time
106
Visualizing And Distributing Stories
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Rustam uses
Hypercane to help
him choose a page
and then view its
change over time.
107. @shawnmjones @StormyArchives
Rustam chooses one of Raintale’s default templates
because he is using the DSA Toolkit for exploration
107
Visualizing And Distributing Stories
Rustam’s story
seems plain, but he
is really interested in
the changing text
over time.
108. @shawnmjones @WebSciDL
Olayinka wants to see what different news sources said
on the same day in different years
108
Visualizing And Distributing Stories
With our SHARI
process, she can
compare different
years to each other
2018
US Elections
2020
COVID-19
2019
Mass shootings in El Paso
and Dayton
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
109. @shawnmjones @StormyArchives
Olayinka can look through the stories produced by our
SHARI process to perform her comparisons
109
Visualizing And Distributing Stories
Our process is not
just limited to our
implementation,
and allows us to
incorporate input
from other systems,
like StoryGraph.
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Select
Exemplars
Generate
Story
Metadata
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
110. @shawnmjones @WebSciDL
Outline
1. Motivation And Research
Questions
2. Background And Related Work
3. Selecting Exemplars And
Generating Story Metadata
4. Generating Document Metadata
5. Visualizing And Distributing Stories
6. Contributions And Conclusion
110
112. @shawnmjones @WebSciDL
We established a vocabulary for different
types and structural features of collections
112
Type % of
Archive-It
Collections
Description Example
Collection
Self-Archiving 54.1% an organization
archiving itself
University of
Utah Web
Archive
Subject-based 27.6% seeds bound by
single topic
Environmental
Justice
Time Bounded
– Expected
14.1% an expected event
or time period
2008 Olympics
Time Bounded
– Spontaneous
4.2% unexpected event Tucson
Shootings
Based on a manual review of 3,382 Archive-It
collections, we classified them into 4 types.
Growth curves give us
some idea of the curatorial
involvement with a
collection over time.
Structurally, for seeds, we can study the:
• distribution of domains
• distribution of path depths
• most frequent path depth
• query string usage
iPres 2018
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Contributions
113. @shawnmjones @StormyArchives
Word count is a fast, effective intra-TimeMap
method of identifying off-topic mementos
113
iPres 2018
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Technical problems
Page gone
Hacking
Moving on from topic
Contributions
114. @shawnmjones @WebSciDL
We devised a set of primitives for intelligently selecting
exemplars from web archive collections
114
Contributions
115. @shawnmjones @StormyArchives
Hypercane implements our primitives for
selecting exemplars
115
ACM/IEEE
JCDL 2021
ACM SIGWEB
Newsletter 2021
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Intelligent Sampling for Web Archive
Collections. In ACM/IEEE Joint Conference on Digital Libraries, 2021. [To be published in 2021]
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson. Hypercane: Toolkit for Summarizing Large
Collections of Archived Pages. In SIGWEB Newsletter Autumn, 2021. [To be published in 2021]
Contributions
116. @shawnmjones @WebSciDL
We created four different algorithms from these
primitives and found that they produce exemplars with
low retrievability with a state-of-the-art search engine
116
We applied four different
query methods to the
mementos surfaced by
these algorithms.
As designed, our DSA4
algorithm surfaced more
novel exemplars than
those discoverable via
the search engine.
We measured mean
retrievability and zero
retrievability to determine
how easy a document
was to retrieve with the
given query method.
Contributions
117. @shawnmjones @StormyArchives
Our user study provides engineers support for
choosing social cards over other surrogate types
117
From our user study, correct answers per
surrogate indicate that social cards
probably outperform the Archive-It
surrogate
With social cards, users were able to
correctly answer our questions without
as much interaction.
ACM CIKM 2019
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM International Conference on Information and Knowledge
Management, 2019. https://doi.org/10.1145/3357384.3358039.
Contributions
118. @shawnmjones @WebSciDL
We established methods for generating the metadata
for social cards if it does not exist
118
S. M. Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “Automatically Selecting Striking Images for Social
Cards,” In ACM Web Science Conference, 2021. https://doi.org/10.1145/3447535.3462505.
ACM Web Science
2021
For choosing striking
images, we trained
classifiers using base image
features (e.g., pixel size,
color count) to choose the
same striking image that
web page authors chose.
Random Forest with these
base image features
performed best.
Contributions
119. @shawnmjones @StormyArchives
We explored the reasons for metadata adoption
119
S. M. Jones, V. Neblitt-Jones, M. C. Weigle, M. Klein, and M. L. Nelson, “It's All About The Cards: Sharing
on Social Media Probably Encouraged HTML Metadata Growth,” To be in ACM/IEEE Joint Conference
on Digital Libraries, 2021. [preprint: https://arxiv.org/abs/2104.04116.]
ACM/IEEE
JCDL 2021
Many efforts have been
made to encourage
metadata adoption by
web pages authors.
Once social card
metadata became
available, its use
skyrocketed!
Contributions
120. @shawnmjones @WebSciDL
We released MementoEmbed and Raintale as reference
implementations for visualizing and distributing stories
120
WADL 2020 WADL 2020
We detailed how to generate
document metadata with
MementoEmbed and visualize and
distribute the story with Raintale.
We also provided an example of
these processes for a day’s news.
Contributions
S. M. Jones, M. Klein, M. C. Weigle, and M. L. Nelson, “MementoEmbed and Raintale for Web Archive
Storytelling,” In Web Archiving and Digital Libraries Workshop, 2020. https://arxiv.org/abs/2008.00137
S. M. Jones, A. C. Nwala, M. C. Weigle, M. Klein, and M. L. Nelson, “SHARI -- An Integration of Tools to
Visualize the Story of the Day,” In Web Archiving and Digital Libraries Workshop, 2020.
https://arxiv.org/abs/2008.00139
121. @shawnmjones @WebSciDL
And I am eager to apply this
expertise at
Los Alamos National Laboratory’s
Information Sciences Division
(CCS-3)
121
https://oduwsdl.github.io/dsa-puddles/shawnmjones/
122. @shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
122
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/
123. @shawnmjones @StormyArchives
Using our model and the lessons from these research
questions, we have implemented tools to tell stories that
summarize web archive collections
123
Generate
Story
Metadata
Select
Exemplars
Generate
Document
Metadata
Visualize
The
Story
Distribute
The
Story
Read the dissertation for
• use cases
• more example stories
• details on experiments
• details on these tools
• examples with web
archives other than
Archive-It
A sample of future work ideas:
• better summary evaluation
• augmenting collections with live web metadata
• entity/topic cards rather than social cards
• summarizing scholar output, project status, scatter/gather
interfaces
• solving corporate intranet search problems
Contributions:
• 5-process model for automatic
storytelling
• vocabulary for types of web archive
collections
• structural features of web archive
collections
• word count works best for identifying off-
topic mementos
• set of primitives for building algorithms
• algorithms built with primitives select
novel exemplars that standard search
engine did not discover
• social cards provide better
understanding that the existing state of
the art web archive surrogates
• machine learning can the same select
striking images as a page author
• Hypercane, MementoEmbed, and
Raintale as implementations
Conclusion
https://oduwsdl.github.io/dsa/
What story will you tell with web archives?
125. @shawnmjones @StormyArchives
As collection users, we view Archive-It collections
from outside…
125
• Curators select seeds, which are captured as seed mementos
• Deep mementos are created from other pages linked to seeds
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
Selecting Exemplars and Generating Story Metadata
126. @shawnmjones @StormyArchives
Response times per surrogate had interesting
means, but p-values were not statistically
significant at p < 0.05
126
0 20 40 60 80 100 120 140 160
Archive-It Facsimile
Browser Thumbnails
Social Cards
sc+t
sc/t
sc^t
Response Times Per Surrogate
Median Mean
p = 0.190
p = 0.202
S. M. Jones, M. C. Weigle, and M. L. Nelson, “Social Cards Probably Provide For Better Understanding Of
Web Archive Collections,” In ACM CIKM, 2019. https://doi.org/10.1145/3357384.3358039.
127. @shawnmjones @StormyArchives
The Off-Topic Memento Toolkit (OTMT) compares a seed’s first
memento with the seed’s other mementos via different
measures…
Measure Fully Equivalent
Score
Fully Dissimilar
Score
Preprocessing
Performed
OTMT -tm
keyword
Byte Count 0.0 -1.0 No bytecount
Word Count 0.0 -1.0 Yes wordcount
Jaccard Distance 0.0 1.0 Yes jaccard
Sørensen-Dice 0.0 1.0 Yes sorensen
Simhash of Term
Frequencies
0 64 Yes simhash-tf
Simhash or raw
memento
0 64 No simhash-raw
Cosine Similarity of
TF-IDF Vectors
1.0 0 Yes cosine
Cosine Similarity of
LSI Vectors
1.0 0 Yes gensim_lsi
127
S. M. Jones, M. C. Weigle, and M. L. Nelson. 2018. The Off-Topic Memento Toolkit. In International
Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/UBW87
128. @shawnmjones @StormyArchives
Does most of the collection exist earlier or later in its
life?
128
This collection was created in
March 2010.
Most of its mementos come from
2016 – 2018.
Most of this collection exists later
in its life.
Structural feature discussed here:
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
129. @shawnmjones @StormyArchives
When did the curator select and archive a collection’s
contents?
129
This collection was created in
March 2006.
Some of the seeds were
selected in 2006.
Many of the seeds were
selected all along its life.
It has mementos as recent as
July 2018.
Structural feature discussed here:
• area under the seed growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
130. @shawnmjones @StormyArchives
Did the curator create a collection intended to archive new versions of
the same web pages repeatedly?
130
This collection was
created in June 2014.
The seeds were selected
toward the beginning of
its life.
Mementos were
captured all during its life.
Structural feature discussed here:
• area under the seed growth curve
• area under the seed memento growth curve
• lifespan of the collection
S. M. Jones, A. Nwala, M. C. Weigle, and M. L. Nelson. 2018. The Many Shapes of Archive-It. In
International Conference on Digital Preservation (iPRES) 2018. https://doi.org/10.17605/OSF.IO/EV42P
131. @shawnmjones @StormyArchives
The Memento Protocol provides us a standard
method for acquiring information from web archives
131
Background and Related Work
Memento gives us TimeGates – identified
by URI-G – for finding a specific memento
based on its original resource and
capture datetime, its memento-datetime.
Memento also gives us TimeMaps – identified by
URI-T – for listing all of the mementos for an original
resource and their memento-datetimes.
<http://a.example.org>;rel="original",
<http://arxiv.example.net/timemap/http://a.example.org>; rel="self";
type="application/link-format"
; from="Tue, 20 Jun 2000 18:02:59 GMT"
; until="Wed, 21 Jun 2000 04:41:56 GMT",
<http://arxiv.example.net/timegate/http://a.example.org>; rel="timegate",
<http://arxiv.example.net/web/20000620180259/http://a.example.org>;
rel="first memento";datetime="Tue, 20 Jun 2000 18:02:59 GMT",
<http://arxiv.example.net/web/20091027204954/http://a.example.org>; rel="last
memento";datetime="Tue, 27 Oct 2009 20:49:54 GMT",
<http://arxiv.example.net/web/20000621011731/http://a.example.org>;
rel="memento";datetime="Wed, 21 Jun 2000 01:17:31 GMT",
<http://arxiv.example.net/web/20000621044156/http://a.example.org>;
rel="memento";datetime="Wed, 21 Jun 2000 04:41:56 GMT”
...
URI-R URI-T
URI-M
memento-datetime
URI-G
Van de Sompel, H. Nelson, M. & Sanderson, R. “RFC 7089 – HTTP Framework for Time-Based
Access to Resource States -- Memento”. http://www.rfc-editor.org/info/rfc7089. 2013.
132. @shawnmjones @StormyArchives
We use surrogates all of the time!
132
Browser Thumbnail (example from UK Web Archive)
Text snippet (example from Bing)
Social Card (example from Facebook)
Text + Thumbnail (example from Internet Archive)
S. M. Jones. “Let's Get Visual and Examine Web Page Surrogates.” https://ws-
dl.blogspot.com/2018/04/2018-04-24-lets-get-visual-and-examine.html, 2018.
Motivation and Research Questions
133. @shawnmjones @WebSciDL
Surrogates are not new!
Traditional surrogates contain metadata
generated by humans to convey aboutness
133
An individual surrogate
summarizes an item.
Card catalogs, however, were not stories, just manual methods for
finding individual items in collections.
Motivation and Research Questions
134. @shawnmjones @StormyArchives
Surrogates provide a visual summary of the
content behind a URI…
134
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:
Background and Related Work
135. @shawnmjones @WebSciDL
Social media storytelling uses surrogates to
provide a “summary of summaries”
135
2 resources are shown from this Wakelet story
6 resources are shown from this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.
136. @shawnmjones @StormyArchives
The Problem: Understanding
web archive collections is
costly
136
§ There are multiple collections about the
“same concept.”
§ The metadata for each collection is non-
existent, or inconsistently applied.
§ A seed is a web page to be crawled.
§ A memento is an observation of a seed at
a specific point in time.
§ Many collections have
1000s of seeds with multiple mementos.
§ There are more than 14,000 collections.
§ Archive-It is a popular platform, but other
web archive collection platforms exist
(e.g., Library of Congress, Conifer, Trove).
§ Existing solutions do not handle the time
dimension inherent to web archive
collections.
more seeds = less metadata
137. @shawnmjones @StormyArchives 137
Our Solution: Social media storytelling uses groups
of surrogates to provide a “summary of summaries”
Each surrogate summarizes a web resource.
Each story groups the surrogates, summarizing the topic.
We want to use this technique to summarize web archive collections
because users are already familiar with this visualization paradigm.
We established a
five-process model
for storytelling with
web archive
collections
A surrogate
summarizes a
web page.
This surrogate
type is called a
social card.
Storytelling is the visualization. Our
contribution is the automation that
selects the exemplars and
metadata that make this story.
138. @shawnmjones @StormyArchives
The problem, summarized
§ There are multiple collections
about the same concept.
§ The metadata for each
collection is non-existent, or
inconsistently applied.
§ Many collections have
1000s of seeds with multiple
mementos.
§ There are more than 14,000
collections.
§ Human review of these
mementos for collection
understanding is an expensive
proposition.
138
139. @shawnmjones @StormyArchives
Archive-It allows easy collection creation
Archive-It was created by the Internet Archive as a consistent user interface for constructing
web archive collections. Curators can supply live web resources as seeds and establish crawling
schedules of those seeds to create mementos.
139
140. @shawnmjones @StormyArchives
Reviewing mementos manually is costly
This collection has 132,599 seeds, many
with multiple mementos
Some collections have 1000s of
seeds
Each seed can have many
mementos
In some cases, this can require
reviewing 100,000+ documents to
understand the collection
140
141. @shawnmjones @StormyArchives
More Archive-It collections are added every
year
More than 14,000 collections exist as of the end of 2020
141
0
500
1000
1500
2000
2500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
#
of
Collections
Year
# of New Archive-It Collections Per Year
All Collections Only Private Collections Only Public Collections
142. @shawnmjones @StormyArchives
Latent Semantic Analysis for document
clustering
142
LSA utilizes a term-document matrix
• rows correspond to terms and columns
correspond to documents
• elements are typically weighted via TF-IDF
• if TF-IDF, then it is proportional to the
number of times the terms appear in
each document
• use single value decomposition to create
two new matrices
• the last of these matrices contains a set of
documents with coordinates for each cluster
LSA requires that the user supply the desired
number of topics. Dark cells indicate high weights.
High weights signify clustering.
Wikipedia contributors. (2019, July 26). Latent semantic analysis. In Wikipedia, The Free Encyclopedia.
Retrieved 21:31, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_semantic_analysis&oldid=907976703
it will be
difficult to
generalize
this number
across types
of collections
143. @shawnmjones @StormyArchives
Latent Dirichlet Allocation
For a corpus D consisting of M documents each of length Ni
1. Choose where and
is a Dirichlet distribution with symmetric parameter
which typically is sparse ( )
2. Choose where and typically is sparse
3. For each of the word positions i, j where and
1. Choose a topic
2. Choose a word
Wikipedia contributors. (2019, July 25). Latent Dirichlet allocation. In Wikipedia, The Free Encyclopedia.
Retrieved 20:13, July 31, 2019,
from https://en.wikipedia.org/w/index.php?title=Latent_Dirichlet_allocation&oldid=907806560 143
K is the number of topics requested by the user
M is the number of documents in the corpus
N is the number of words
is the word distribution for topic k
is the topic distribution for document i
zij is the topic for the j-th word in document i
wij is a specific word in document i *e.g. of multinomial – probability of counts of
each side for rolling k-sided die n times
it will be difficult
to generalize this
number across
types of
collections
144. @shawnmjones @WebSciDL
Many have tackled selecting exemplar sentences or
images from a document, few have covered selecting
exemplar documents from a corpus over time.
144
Background and Related Work
We are
inspired by
these
solutions and
will apply
some of their
ideas in a
moment.
Silva et al. word graphs Silva and Sampaio. 2014. Using Luhn’s Automatic Abstract Method to Create Graphs of
Words for Document Visualization. Social Networking. 65-70.
https://doi.org/10.4236/sn.2014.32008.
R. Sipos et al. 2012. Temporal corpus summarization using submodular word coverage.
In ACM CIKM 2012, 754-763. https://doi.org/10.1145/2396761.2396857.
Sipos et al. influential
author clusters
145. @shawnmjones @StormyArchives
Existing tools for web archive collections require that the
user have access to WARCs.
145
ArchiveSpark Archives Unleashed
Cloud
(now part of Archive-It)
Archivists are the only
ones likely to have that
access. We want anyone
to be able to summarize a
collection.
Warclight
Background and Related Work
Holzmann et al. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In ACM/IEEE JCDL
2016, 83-92. https://doi.org/10.1145/2910896.2910902.
Ruest et al. 2014. archivesunleashed/warclight – A Rails engine supporting the discovery of web archives.
https://github.com/archivesunleashed/warclight.
Deschamps et al. 2019. The Cost of a WARC: Analyzing Web Archives in the Cloud. In
ACM/IEEE JCDL 2019, 261-264. https://doi.org/10.1109/JCDL.2019.00043.
Stories also need URIs for
linking surrogates. WARCs
alone cannot do this.
146. @shawnmjones @StormyArchives
Existing work on generating story metadata relies
on archivists to manually review and annotate
each seed or memento
146
Scale is the greatest
challenge here. Web
archive collections
grow quickly, and
archivists have a
hard time keeping up
with the number of
documents to
annotate.
Background and Related Work
D. V. Pitti, “Encoded Archival Description,” D-Lib Magazine, vol. 5, no. 11, 1999.
https://doi.org/10.1045/november99-pitti.
Encoded Archival Description could
work, if there were not thousands of
documents to annotate.
147. @shawnmjones @StormyArchives
Other studies on surrogates did not focus on if participants
understood the underlying collection, instead whether
participants chose the correct search result for a query
147
These studies did not compare
thumbnails to social cards
directly.
Web archives love using
thumbnails, but is there
something better for visitors?
Background and Related Work
148. @shawnmjones @StormyArchives
Others tried to visualize whole collections at once or
created solutions specific to a web archive
148
Conta Me Histórias
Padia et al.
R. Campos et al. 2021. Automatic generation of timelines for past-web events. The Past
Web: Exploring Web Archives, 225-242. https: //doi.org/10.1007/978-3-030-63291-5_18.
K. Padia, Y. AlNoamany, and M. C. Weigle, “Visualizing digital collections at Archive-It,”
in Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries,
(Washington, DC, USA), pp. 15–18, 2012. https://doi.org/10.1145/ 2232817.2232821.
Background and Related Work
149. @shawnmjones @StormyArchives
Web surrogates provide a visual summary
of the content behind a URI…
149
https://www.google.com/maps/dir/Old+Dominion+University,+Norfolk,+VA/Los+Alamos+National+Laboratory,+New+Mexico/@35 .3644614,-
109.356967,4z/data=!3m1!4b1!4m13!4m12!1m5!1m1!1 s0x89ba99ad24ba3945:0xcd2bdc432c4e4bac!2m2!1d-76.3067676!2d36
.8855515!1m5!1m1!1s0x87181246af22e765:0x7f5a90170c5df1b4!2m2!1 d-106.287162!2d35.8440582
Long URI:
The same URI as a browser
thumbnail surrogate:
The same URI as a social card
surrogate:
150. @shawnmjones @StormyArchives
Social media storytelling uses surrogates to provide a
“summary of summaries”
150
2 resources are shown in this Wakelet story
6 resources are shown in this Storify story
Each surrogate summarizes
a web resource.
Each story groups the
surrogates, summarizing the
topic.
We want to use this
technique to summarize
web archive collections
because users are already
familiar with this visualization
paradigm.