SlideShare a Scribd company logo
1 of 91
Summarizing archival collections
using storytelling techniques
Yasmin AlNoamany
Michele C. Weigle
Michael L. Nelson
Old Dominion University
Web Science & Digital Libraries Research Group
www.cs.odu.edu/~mln/
@phonedude_mln
Research Funded by IMLS LG-71-15-0077-15
Dodging the Memory Hole
Los Angeles, CA, 2016-10-14
Archive-It, a subscription-based service,
allows creation of collections
2
> 3,000
collections
> 340
institutions
> 10B archived
pages
3
Collection
title
Collection
categorization
based on the
curator
Seed URI
Metadata
about the
collection
Text search
box
The group that
the resource
belongs to
List of
the seed
URIs
Timespan of the
resource
and the number
of times it has
been captured
Collection understanding and collection
summarization are not currently supported
Not easy to answer “what’s in that collection?” or
“how is this collection different from others”?
4
There is more than one collection about
“Egyptian Revolution”
5
• “2010-2011 Arab Spring” https://archive-it.org/collections/3101
• “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349
• “Egypt Revolution and Politics” https://archive-it.org/collections/2358
6
One of at least seven Human Rights collections…
7
8
Our early attempts at collection understanding
tried to include everything…
9
“Visualizing digital collections at Archive-It”, JCDL 2012.
http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
1000s of seeds X 1000s of archived pages ==
Conventional Vis Methods Not Applicable
10
Idea:
Storytelling
11
Stories in literature
Story elements: setting, characters, sequence, exposition, conflict,
climax, resolution
Once upon a time
http://www.learner.org/interactives/story/
12
Stories in social media
“It's hard to define a story, but I know it when I see it” (Alexander, 2008)
basically, just arranging web pages in time
13
“Storytelling” is becoming a popular
technique in social media
14
What are the limitations of
storytelling services?
15
The Egyptian Revolution on Storify
16
Bookmarking, not preserving!
17
Despite these limitations, how do we
combine storytelling & archives?
18
Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
19
We sample k mementos from N pages of the
collection (k << N) to create a summary story
S
1
S
2
S
3
S
4
S
2
S
1
S
3
Collection Y
S
3
S
2
S
1
Collection Z
Collection X
20
Yasmin hand-crafted stories to summarize the
Egyptian Revolution collection for her son, Yousof
https://storify.com/yasmina_anwar/the-egyptian-revolution-
on-archive-it-collection
https://storify.com/yasmina_anwar/the-story-of-the-egyptian-
revolution-from-archive- 21
How do we generate this automatically?
22
Collections have two dimensions:
{Fixed, Sliding} X {Page, Time}
t1 t3t2 t5t4 tk
…
URI
Time
t6
23
…
…
Fixed Page, Fixed Time
A desktop Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Android Chrome user-agent
http://www.cnn.com/2014/02/24/world/africa/egypt-
politics/index.html?hpt=wo_c2
Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013.
Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 .
24
Feb 1 Feb 1 Feb 2
Feb 4 Feb 5 Feb 7
Feb 9 Feb 11 Feb 11
25
Fixed Page, Sliding Time
Feb. 11, 2011
Mubarak resigns
26
Sliding Page, Fixed Time
Jan 27 Jan 31
Feb 7Feb 4
Feb 11 Feb 11
Feb 2
Jan 25
Feb 10
27
Sliding Page, Sliding Time
The Dark and Stormy Archives (DSA) framework
Establish a
baseline
Reduce the candidate
pool of archived pages
Select good
representative
pages
Characteristics of
human-generated
Stories
Characteristics of
Archive-It
collections
Exclude duplicates
Exclude off-topic pages
Exclude non-English Language
Dynamically slice the collection
Cluster the pages
in each slice
Select high-quality
pages from each
cluster
Order pages
by time
Visualize
28
https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
Establish a baseline of
social media stories
"Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016.
29
What is the length of a story
(the number of resources per story)?
This story has
31 resources
1
3
2
30
What are the types of resources that
compose a story?
Quotes
Video
31
This story has
• 19 quotes
• 8 images
• 4 videos
What are the most frequently used domains?
Twitter.com
Twitter.com
Twitter.com
32
This story has
• 90% twitter.com
• 7% instagram.com
• 3% facebook.com
Top 25 domains represents 92%
of all domains
33
What differentiates a popular story?
(popular = stories with the top 25% of views)
19,795 views 64 views
34
The distributions for the features of the stories
• Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the
unpopular stories are different in terms of most of the features
• Popular stories tend to have:
• more web elements (medians of 28 vs. 21)
• longer timespan (5 hours vs. 2 hours) than the unpopular stories
35
Do popular stories have a lower decay rate?
The 75th percentile of decay rate per popular story is 10% of the resources,
while it is 15% in the unpopular stories
36
We found that 28 mementos is a good
number for the resources in the stories.
37
Establish a baseline of current
Archive-It collections
"Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016.
38
The mean and median number of
URIs in a collection
This collection has 435 seed URIs 39
The mean and median number of
mementos per URI
This seed URI has 16 mementos 40
The most frequent used domains
abcnews.go.com
blogspot.com
This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com
41
Archive-It top 25 is fundamentally
different than Storify top 25
42
Archive-It top 25 is fundamentally
different than Storify top 25
43
Twitter
is #10
not #1
What we archive and
what we share on social media
are different subsets of the web
(seeds != shares)
44
see also: Brunelle, et al., “The impact of JavaScript on archivability”, IJDL 2015
Detecting off-topic pages
"Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016.
45
Archive-It provides their partners with tools
that allow them to build themed collections
46
Archive-It tools are about HTTP events /
mechanics, not “content”
47
These tools won’t detect that > 60% of
mementos of hamdeensabahy.com are off-topic
May 13, 2012: The page started as
on-topic.
May 24, 2012: Off-topic due to a
database error.
Mar. 21, 2013: Not working because of
financial problems.
May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired.
http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com
48
How do we automatically detect
off-topic pages?
49
Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
50
Textual content
cosine similarity, intersection of the most frequent terms,
Jaccard similarity
Method Similarity
cosine 0.7
TF-Intersection 0.6
Jaccard 0.5
Method Similarity
cosine 0.0
TF-Intersection 0.0
Jaccard 0.0
51
Semantics of the text
Web based kernel function using the search engine (SE)
52
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
Semantics of the text
Web based kernel function using the search engine (SE)
Method Similarity
SE-Kernel 0.7
53
Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
Structural methods
no. of words, content-length
100 109
Method % change
WordCount 0.09
54
Structural methods
no. of words, content-length
100 109
100 5
Method % change
WordCount 0.09
Method % change
WordCount -0.95
55
We built a gold standard data set to
evaluate the methods
56
We manually labeled 15,760 mementos
Egypt Revolution and Politics
URI-Rs: 136
URI-Ms: 6,886
Off-topic URI-Ms: 384
Occupy Movement
URI-Rs: 255
URI-Ms: 6,570
Off-topic URI-Ms: 458
Columbia Univ. Human Rights collection
URI-Rs: 198
URI-Ms: 2,304
Off-topic URI-Ms: 94 57
Evaluated 6 methods + combos at 21 thresholds
Averaged the results at each threshold over the three gold standard
collections
Similarity Measure Threshold FP FN FP+FN ACC F1 AUC
(Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968
(Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934
Cosine 0.15 31 22 53 0.983 0.881 0.961
(WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885
WordCount -0.85 6 44 50 0.982 0.806 0.870
SEKernel 0.05 64 83 147 0.965 0.683 0.865
Bytes -0.65 28 133 161 0.962 0.584 0.746
Jaccard 0.05 74 86 159 0.962 0.538 0.809
TF-Intersection 0.00 49 104 153 0.967 0.537 0.740
58
Average precision of 0.89 on 18 different
Archive-It collections
59
(Cosine,WordCount) with (0.10,-0.85) thresholds
How do we dynamically divide the
collections into appropriate slices?
(in other words, how do we pick just 28?)
60
We expected most collections to look like this…
The Global Food Crisis collection at Archive-It
61
This is what we found
Egypt Revolution and Politics
Human RightsApril 16 Archive Virginia Tech Shooting
Jasmine Revolution 2011 Wikileaks Document Release
62
Selecting representative pages for
generating stories
(skipping clustering details, but goal is k=28)
63
Quality metrics for selecting mementos
• In the DSA, memento quality Mq is calculated as
following:
Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc
• Dm is the memento damage (Brunelle, JCDL 2014)
• Sql is the snippet quality based on the URI level
• Sqc is the snippet quality based on URI category
• wm, wql, wqc are the weights of memento damage, level,
and category
64
We prefer a higher quality memento (Dm)
http://wayback.archive-it.org/2358/20110201231457/
http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
http://wayback.archive-it.org/2358/20110201231622/
http://www.bbc.co.uk/news/world/middle_east/
65
Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
We prefer pages with attractive snippets
https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-
to-auction-treasury-bills/
https://wayback.archive-
it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
66
We prefer deep links over
high level domains (Sql)
Feb. 11, 2011: the homepage of BBC on Storify
Feb. 11, 2011: the homepage of BBC Middle East section on Storify
Feb. 11, 2011: the article of BBC on Storify
https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/
https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045
https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/
67
Social media pages may not produce
good snippets (Sqc)
http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk
68
Visualizing stories in Storify
69
Remember Yasmin’s hand-crafted stories?
70
Remember Yasmin’s hand-crafted stories?
71
We extract the metadata of the pages
and order them chronologically
{ "elements":[
{
"permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16-
virginia-tech_N.htm", "type":"link",
"source":{"href":"http://www.usatoday.com",
"name":"www.usatoday.com
@ 23, May 2007"}
},
{
"permalink":"http://wayback.archive-
it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link",
"source":{"href":"http://www.time.com",
"name":"www.time.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/",
"type":"link", "source":{"href":"http://www.collegiatetimes.com",
"name":"www.collegiatetimes.com
@ 30, May 2007" }
},
{
"permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/",
"type":"link", "source":{ "href":"http://hokies416.wordpress.com",
"name":"hokies416.wordpress.com
@ 06, Jun 2007" }
},
…
{ "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/",
"type":"link", "source":{"href":"http://www.hokiesports.com",
"name":"www.hokiesports.com
@ 20, Jun 2007" } },
],
"description":"This is an automatically generated story from Archive-It collection.", "title":"April
16 Archive ”
}
72
Using the Storify API, we
override the default metadata
to generate more attractive
snippets
Example of an automatically generated story
73
Notice the good
metadata: images,
titles with dates,
favicons
Evaluating the Dark and Stormy
Archive framework
(how good are the automatically generated stories?)
74
Evaluation is tricky!
(two perfectly good stories could have non-overlapping k=28
elements!)
• We use human evaluators (via Amazon's
Mechanical Turk) to compare:
• Human-generated stories
• DSA (automatically) generated stories
• Randomly generated stories
• Successful evaluation means:
• Human and DSA stories are indistinguishable
• Human and DSA stories are better than Random
75
Our guidelines for expert archivists at Archive-It
for generating stories from the collections
76
We received 23 stories for 10
Archive-It collections
SPST is “Sliding Page, Sliding Time”
SPFT is “Sliding Page, Fixed Time”
FPST is “Fixed Page, Sliding Time” 77
https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 78
• Generated by domain experts
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
Automatically generated stories from
archived collections
1. Obtain the seed list and the TimeMap of URIs from the front-end
interface of Archive- It
2. Extract the HTML of the mementos from the WARC files (locally
hosted at ODU) and download the collections that we do not have in
the ODU mirror from Archive-It
3. Extract the text of the page using the Boilerpipe library
4. Eliminate the off-topic pages based on the best-performing method
((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85))
5. Exclude duplicates in each TimeMap
6. Eliminate the non-English language pages
7. Slice the collection dynamically and then cluster the mementos of
each slice using DBSCAN algorithm
8. Apply the quality metrics to select the best representative pages
9. Sort the selected mementos chronologically then put them and their
metadata in a JSON object
79
https://storify.com/mturk_exp/3649b0s
80
• Automatically generated story
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
Random stories
28 mementos were randomly selected from each
collection before excluding off-topic and duplicate
pages
81
https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 82
• Randomly generated story
• Sliding Page, Sliding Time
• The Boston Marathon
Bombing collection
https://storify.com/mturk_exp/3649bads 83
if someone prefers this story,
we exclude their results
• Poorly generated story
• The same memento, 28 times
• The Boston Marathon
Bombing collection
MT experiment setup
• Three HITs for each story (69 HITs to evaluate 23 stories); two
comparisons per HIT:
• HIT1: human vs. automatic, human vs. poor
• HIT2: human vs. random, human vs. poor
• HIT3: random vs. automatic, automatic vs. poor
• 15 distinct turkers with master qualification (i.e., high acceptance rate)
for each HIT
• We rejected the submissions contained poorly-generated stories and
the HITs that were completed in less than 10 seconds (mean time per
HIT = 7 minutes)
• 989 out of 1,035 (69*15) valid HITs
• We awarded the turker $0.50 per HIT
84https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
A sample HIT
85
DSA == Human
(Human,DSA) > Random
86
Automatic versus Human
87
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
Human versus Random
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
88
Automatic versus Random
89
Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
Success!
DSA-generated stories are just as good as stories
generated by human experts
90
Use interface people already know how to use
to summarize collections
Archived collectionsStorytelling services
Archived enriched
stories
91
All the code, datasets,
papers, slides, etc.:
http://bit.ly/YasminPhD

More Related Content

What's hot

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...Justin Brunelle
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Shawn Jones
 
The Many Shapes of Archive-It
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-ItShawn Jones
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Shawn Jones
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesShawn Jones
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsShawn Jones
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media StoriesYasmin AlNoamany, PhD
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesShawn Jones
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItMichele Weigle
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...Alexander Nwala
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesJoyce Kasman Valenza
 
Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Don Boozer
 

What's hot (20)

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
The Many Shapes of Archive-It
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-It
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich  the Live Web Experience Through StorytellingUsing Web Archives to Enrich  the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...Bootstrapping Web Archive Collections  of Stories from Micro-collections in S...
Bootstrapping Web Archive Collections of Stories from Micro-collections in S...
 
Open the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School LibrariesOpen the Door, Let \'em In: Virtual School Libraries
Open the Door, Let \'em In: Virtual School Libraries
 
Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?Wikipedia: Why? Who? and How?
Wikipedia: Why? Who? and How?
 

Viewers also liked

Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web ResourcesMartin Klein
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesMichael Nelson
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research ObjectYasmin AlNoamany, PhD
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?Michael Nelson
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet ArchiveMichael Nelson
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015Michael Nelson
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Michael Nelson
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web ArchivesMichael Nelson
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeMichael Nelson
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolMichael Nelson
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Michael Nelson
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesMichael Nelson
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange ProjectMichael Nelson
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?Michael Nelson
 

Viewers also liked (19)

Robust Linking to Web Resources
Robust Linking to Web ResourcesRobust Linking to Web Resources
Robust Linking to Web Resources
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Software as a Well-Formed Research Object
Software as a Well-Formed Research ObjectSoftware as a Well-Formed Research Object
Software as a Well-Formed Research Object
 
When Should I Make Preservation Copies of Myself?
When Should I Make Preservation Copies of Myself?�When Should I Make Preservation Copies of Myself?�
When Should I Make Preservation Copies of Myself?
 
Who and What Links to the Internet Archive
Who and What Links to the Internet ArchiveWho and What Links to the Internet Archive
Who and What Links to the Internet Archive
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015@WebSciDL PhD Student Project Reviews August 5&6, 2015
@WebSciDL PhD Student Project Reviews August 5&6, 2015
 
Profiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content LanguageProfiling Web Archive Coverage for Top-Level Domain and Content Language
Profiling Web Archive Coverage for Top-Level Domain and Content Language
 
Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member Old Dominion University Computer Science IIPC New Member
Old Dominion University Computer Science IIPC New Member
 
Assessing the Quality of Web Archives
Assessing the Quality of Web ArchivesAssessing the Quality of Web Archives
Assessing the Quality of Web Archives
 
On the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over TimeOn the Change in Archivability of Websites Over Time
On the Change in Archivability of Websites Over Time
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench ToolEvaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
Evaluating the SiteStory Transactional Web Archive with the ApacheBench Tool
 
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
Resurrecting My Revolutionsing Social Link Neighborhood in Bringing Context t...
 
Evaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived PagesEvaluating the Temporal Coherence of Archived Pages
Evaluating the Temporal Coherence of Archived Pages
 
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
OAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange ProjectOAI-ORE:  The Open Archives Initiative  Object Reuse and Exchange Project
OAI-ORE: The Open Archives Initiative Object Reuse and Exchange Project
 
Why Care About the Past?
Why Care About the Past?Why Care About the Past?
Why Care About the Past?
 

Similar to Summarizing archival collections using storytelling techniques

Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesNelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesReynolds Journalism Institute (RJI)
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011Ahmed AlSum
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDan Brickley
 
ACDI – African Climate and Development Initiative 2017
ACDI – African Climate and Development Initiative 2017ACDI – African Climate and Development Initiative 2017
ACDI – African Climate and Development Initiative 2017UCT
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
Essay Topic WHO OWNS THE PASTCultural patrimony (an obje.docx
Essay Topic  WHO OWNS THE PASTCultural patrimony (an obje.docxEssay Topic  WHO OWNS THE PASTCultural patrimony (an obje.docx
Essay Topic WHO OWNS THE PASTCultural patrimony (an obje.docxdebishakespeare
 
Building an ecosystem of networked references
Building an ecosystem of networked referencesBuilding an ecosystem of networked references
Building an ecosystem of networked referencesHugo Manguinhas
 
Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014PrattSILS
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
Environmental & Geographical Sciences- Honours 2016
Environmental & Geographical Sciences-  Honours 2016Environmental & Geographical Sciences-  Honours 2016
Environmental & Geographical Sciences- Honours 2016UCT
 
Library as Place, Place as Library: Duality and the Power of Cooperation
Library as Place, Place as Library: Duality and the Power of CooperationLibrary as Place, Place as Library: Duality and the Power of Cooperation
Library as Place, Place as Library: Duality and the Power of CooperationKaren S Calhoun
 
Globalisation, Environment & Society 2017
Globalisation, Environment & Society 2017Globalisation, Environment & Society 2017
Globalisation, Environment & Society 2017UCT
 
Environmental & Geographical Science Postgraduate students 2016
Environmental & Geographical Science Postgraduate students 2016Environmental & Geographical Science Postgraduate students 2016
Environmental & Geographical Science Postgraduate students 2016UCT
 
Search Technologies for Digital Libraries
Search Technologies for Digital LibrariesSearch Technologies for Digital Libraries
Search Technologies for Digital Librariescneudecker
 
Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017UCT
 
Semantic Digital Humanities Workshop 2015 @Oxford
Semantic Digital Humanities Workshop 2015 @OxfordSemantic Digital Humanities Workshop 2015 @Oxford
Semantic Digital Humanities Workshop 2015 @OxfordLora Aroyo
 

Similar to Summarizing archival collections using storytelling techniques (20)

Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling TechniquesNelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
Nelson, Michael: Summarizing Archival Collections Using Storytelling Techniques
 
How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011How Much of the Web is Archived? JCDL 2011
How Much of the Web is Archived? JCDL 2011
 
Describing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classificationDescribing Everything - Open Web standards and classification
Describing Everything - Open Web standards and classification
 
ACDI – African Climate and Development Initiative 2017
ACDI – African Climate and Development Initiative 2017ACDI – African Climate and Development Initiative 2017
ACDI – African Climate and Development Initiative 2017
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Essay Topic WHO OWNS THE PASTCultural patrimony (an obje.docx
Essay Topic  WHO OWNS THE PASTCultural patrimony (an obje.docxEssay Topic  WHO OWNS THE PASTCultural patrimony (an obje.docx
Essay Topic WHO OWNS THE PASTCultural patrimony (an obje.docx
 
Building an ecosystem of networked references
Building an ecosystem of networked referencesBuilding an ecosystem of networked references
Building an ecosystem of networked references
 
Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014Final project posters for lis 653 spring 2014
Final project posters for lis 653 spring 2014
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Environmental & Geographical Sciences- Honours 2016
Environmental & Geographical Sciences-  Honours 2016Environmental & Geographical Sciences-  Honours 2016
Environmental & Geographical Sciences- Honours 2016
 
Library as Place, Place as Library: Duality and the Power of Cooperation
Library as Place, Place as Library: Duality and the Power of CooperationLibrary as Place, Place as Library: Duality and the Power of Cooperation
Library as Place, Place as Library: Duality and the Power of Cooperation
 
Globalisation, Environment & Society 2017
Globalisation, Environment & Society 2017Globalisation, Environment & Society 2017
Globalisation, Environment & Society 2017
 
Environmental & Geographical Science Postgraduate students 2016
Environmental & Geographical Science Postgraduate students 2016Environmental & Geographical Science Postgraduate students 2016
Environmental & Geographical Science Postgraduate students 2016
 
Search Technologies for Digital Libraries
Search Technologies for Digital LibrariesSearch Technologies for Digital Libraries
Search Technologies for Digital Libraries
 
Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017Geological Sciences Honours Class of 2017
Geological Sciences Honours Class of 2017
 
Semantic Digital Humanities Workshop 2015 @Oxford
Semantic Digital Humanities Workshop 2015 @OxfordSemantic Digital Humanities Workshop 2015 @Oxford
Semantic Digital Humanities Workshop 2015 @Oxford
 
Linking Data the ALM way (Boris Zetterlund)
Linking Data the ALM way (Boris Zetterlund)Linking Data the ALM way (Boris Zetterlund)
Linking Data the ALM way (Boris Zetterlund)
 

More from Michael Nelson

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Michael Nelson
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Michael Nelson
 

More from Michael Nelson (9)

Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving in the Year eaee1902f186819154789ee22ca30035
 
Uncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pagesUncertainty in replaying archived Twitter pages
Uncertainty in replaying archived Twitter pages
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Web Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed OriginalsWeb Archives at the Nexus of Good Fakes and Flawed Originals
Web Archives at the Nexus of Good Fakes and Flawed Originals
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence Weaponized Web Archives: Provenance Laundering of Short Order Evidence
Weaponized Web Archives: Provenance Laundering of Short Order Evidence
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Summarizing archival collections using storytelling techniques

  • 1. Summarizing archival collections using storytelling techniques Yasmin AlNoamany Michele C. Weigle Michael L. Nelson Old Dominion University Web Science & Digital Libraries Research Group www.cs.odu.edu/~mln/ @phonedude_mln Research Funded by IMLS LG-71-15-0077-15 Dodging the Memory Hole Los Angeles, CA, 2016-10-14
  • 2. Archive-It, a subscription-based service, allows creation of collections 2 > 3,000 collections > 340 institutions > 10B archived pages
  • 3. 3 Collection title Collection categorization based on the curator Seed URI Metadata about the collection Text search box The group that the resource belongs to List of the seed URIs Timespan of the resource and the number of times it has been captured
  • 4. Collection understanding and collection summarization are not currently supported Not easy to answer “what’s in that collection?” or “how is this collection different from others”? 4
  • 5. There is more than one collection about “Egyptian Revolution” 5 • “2010-2011 Arab Spring” https://archive-it.org/collections/3101 • “North Africa & the Middle East 2011-2013” https://archive-it.org/collections/2349 • “Egypt Revolution and Politics” https://archive-it.org/collections/2358
  • 6. 6 One of at least seven Human Rights collections…
  • 7. 7
  • 8. 8
  • 9. Our early attempts at collection understanding tried to include everything… 9 “Visualizing digital collections at Archive-It”, JCDL 2012. http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html
  • 10. 1000s of seeds X 1000s of archived pages == Conventional Vis Methods Not Applicable 10
  • 12. Stories in literature Story elements: setting, characters, sequence, exposition, conflict, climax, resolution Once upon a time http://www.learner.org/interactives/story/ 12
  • 13. Stories in social media “It's hard to define a story, but I know it when I see it” (Alexander, 2008) basically, just arranging web pages in time 13
  • 14. “Storytelling” is becoming a popular technique in social media 14
  • 15. What are the limitations of storytelling services? 15
  • 16. The Egyptian Revolution on Storify 16
  • 18. Despite these limitations, how do we combine storytelling & archives? 18
  • 19. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 19
  • 20. We sample k mementos from N pages of the collection (k << N) to create a summary story S 1 S 2 S 3 S 4 S 2 S 1 S 3 Collection Y S 3 S 2 S 1 Collection Z Collection X 20
  • 21. Yasmin hand-crafted stories to summarize the Egyptian Revolution collection for her son, Yousof https://storify.com/yasmina_anwar/the-egyptian-revolution- on-archive-it-collection https://storify.com/yasmina_anwar/the-story-of-the-egyptian- revolution-from-archive- 21
  • 22. How do we generate this automatically? 22
  • 23. Collections have two dimensions: {Fixed, Sliding} X {Page, Time} t1 t3t2 t5t4 tk … URI Time t6 23 … …
  • 24. Fixed Page, Fixed Time A desktop Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Android Chrome user-agent http://www.cnn.com/2014/02/24/world/africa/egypt- politics/index.html?hpt=wo_c2 Schneider and McCown, “First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites”, JCDL 2013. Kelly et al. “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine 2013 . 24
  • 25. Feb 1 Feb 1 Feb 2 Feb 4 Feb 5 Feb 7 Feb 9 Feb 11 Feb 11 25 Fixed Page, Sliding Time
  • 26. Feb. 11, 2011 Mubarak resigns 26 Sliding Page, Fixed Time
  • 27. Jan 27 Jan 31 Feb 7Feb 4 Feb 11 Feb 11 Feb 2 Jan 25 Feb 10 27 Sliding Page, Sliding Time
  • 28. The Dark and Stormy Archives (DSA) framework Establish a baseline Reduce the candidate pool of archived pages Select good representative pages Characteristics of human-generated Stories Characteristics of Archive-It collections Exclude duplicates Exclude off-topic pages Exclude non-English Language Dynamically slice the collection Cluster the pages in each slice Select high-quality pages from each cluster Order pages by time Visualize 28 https://pbs.twimg.com/media/BQcpj7ACMAAHRp4.jpg
  • 29. Establish a baseline of social media stories "Characteristics of Social Media Stories”, TPDL 2015, IJDL 2016. 29
  • 30. What is the length of a story (the number of resources per story)? This story has 31 resources 1 3 2 30
  • 31. What are the types of resources that compose a story? Quotes Video 31 This story has • 19 quotes • 8 images • 4 videos
  • 32. What are the most frequently used domains? Twitter.com Twitter.com Twitter.com 32 This story has • 90% twitter.com • 7% instagram.com • 3% facebook.com
  • 33. Top 25 domains represents 92% of all domains 33
  • 34. What differentiates a popular story? (popular = stories with the top 25% of views) 19,795 views 64 views 34
  • 35. The distributions for the features of the stories • Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features • Popular stories tend to have: • more web elements (medians of 28 vs. 21) • longer timespan (5 hours vs. 2 hours) than the unpopular stories 35
  • 36. Do popular stories have a lower decay rate? The 75th percentile of decay rate per popular story is 10% of the resources, while it is 15% in the unpopular stories 36
  • 37. We found that 28 mementos is a good number for the resources in the stories. 37
  • 38. Establish a baseline of current Archive-It collections "Characteristics of Social Media Stories. What makes a good story?", International Journal on Digital Libraries 2016. 38
  • 39. The mean and median number of URIs in a collection This collection has 435 seed URIs 39
  • 40. The mean and median number of mementos per URI This seed URI has 16 mementos 40
  • 41. The most frequent used domains abcnews.go.com blogspot.com This collection has 30% abcnews.com, 10% blogspot.com, 3% facebook.com 41
  • 42. Archive-It top 25 is fundamentally different than Storify top 25 42
  • 43. Archive-It top 25 is fundamentally different than Storify top 25 43 Twitter is #10 not #1
  • 44. What we archive and what we share on social media are different subsets of the web (seeds != shares) 44 see also: Brunelle, et al., “The impact of JavaScript on archivability”, IJDL 2015
  • 45. Detecting off-topic pages "Detecting Off-Topic Pages in Web Archives”, TPDL 2015, IJDL 2016. 45
  • 46. Archive-It provides their partners with tools that allow them to build themed collections 46
  • 47. Archive-It tools are about HTTP events / mechanics, not “content” 47
  • 48. These tools won’t detect that > 60% of mementos of hamdeensabahy.com are off-topic May 13, 2012: The page started as on-topic. May 24, 2012: Off-topic due to a database error. Mar. 21, 2013: Not working because of financial problems. May 21, 2013: On-topic again June 5, 2014: The site has been hacked Oct. 10, 2014: The domain has expired. http://wayback.archive-it.org/2358/*/http://hamdeensabahy.com 48
  • 49. How do we automatically detect off-topic pages? 49
  • 50. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 50
  • 51. Textual content cosine similarity, intersection of the most frequent terms, Jaccard similarity Method Similarity cosine 0.7 TF-Intersection 0.6 Jaccard 0.5 Method Similarity cosine 0.0 TF-Intersection 0.0 Jaccard 0.0 51
  • 52. Semantics of the text Web based kernel function using the search engine (SE) 52 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  • 53. Semantics of the text Web based kernel function using the search engine (SE) Method Similarity SE-Kernel 0.7 53 Sahami and Heilman, A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets, WWW 2006
  • 54. Structural methods no. of words, content-length 100 109 Method % change WordCount 0.09 54
  • 55. Structural methods no. of words, content-length 100 109 100 5 Method % change WordCount 0.09 Method % change WordCount -0.95 55
  • 56. We built a gold standard data set to evaluate the methods 56
  • 57. We manually labeled 15,760 mementos Egypt Revolution and Politics URI-Rs: 136 URI-Ms: 6,886 Off-topic URI-Ms: 384 Occupy Movement URI-Rs: 255 URI-Ms: 6,570 Off-topic URI-Ms: 458 Columbia Univ. Human Rights collection URI-Rs: 198 URI-Ms: 2,304 Off-topic URI-Ms: 94 57
  • 58. Evaluated 6 methods + combos at 21 thresholds Averaged the results at each threshold over the three gold standard collections Similarity Measure Threshold FP FN FP+FN ACC F1 AUC (Cosine,WordCount) (0.10,-0.85) 24 10 34 0.987 0.906 0.968 (Cosine,SEKernel) (0.10,0.00) 6 35 40 0.990 0.901 0.934 Cosine 0.15 31 22 53 0.983 0.881 0.961 (WordCount,SEKernel) (-0.80,0.00) 14 27 42 0.985 0.818 0.885 WordCount -0.85 6 44 50 0.982 0.806 0.870 SEKernel 0.05 64 83 147 0.965 0.683 0.865 Bytes -0.65 28 133 161 0.962 0.584 0.746 Jaccard 0.05 74 86 159 0.962 0.538 0.809 TF-Intersection 0.00 49 104 153 0.967 0.537 0.740 58
  • 59. Average precision of 0.89 on 18 different Archive-It collections 59 (Cosine,WordCount) with (0.10,-0.85) thresholds
  • 60. How do we dynamically divide the collections into appropriate slices? (in other words, how do we pick just 28?) 60
  • 61. We expected most collections to look like this… The Global Food Crisis collection at Archive-It 61
  • 62. This is what we found Egypt Revolution and Politics Human RightsApril 16 Archive Virginia Tech Shooting Jasmine Revolution 2011 Wikileaks Document Release 62
  • 63. Selecting representative pages for generating stories (skipping clustering details, but goal is k=28) 63
  • 64. Quality metrics for selecting mementos • In the DSA, memento quality Mq is calculated as following: Mq = (1 − wm*Dm) + wql*Sql + wqc*Sqc • Dm is the memento damage (Brunelle, JCDL 2014) • Sql is the snippet quality based on the URI level • Sqc is the snippet quality based on URI category • wm, wql, wqc are the weights of memento damage, level, and category 64
  • 65. We prefer a higher quality memento (Dm) http://wayback.archive-it.org/2358/20110201231457/ http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/20110201231622/ http://www.bbc.co.uk/news/world/middle_east/ 65 Brunelle et al. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources, JCDL 2014
  • 66. We prefer pages with attractive snippets https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country- to-auction-treasury-bills/ https://wayback.archive- it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1 66
  • 67. We prefer deep links over high level domains (Sql) Feb. 11, 2011: the homepage of BBC on Storify Feb. 11, 2011: the homepage of BBC Middle East section on Storify Feb. 11, 2011: the article of BBC on Storify https://wayback.archive-it.org/2358/20110211191429/http://www.bbc.co.uk/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211191942/http://www.bbc.co.uk/news/world/middle_east/ 67
  • 68. Social media pages may not produce good snippets (Sqc) http://wayback.archive-it.org/1784/20100131023240/http:/twitter.com/Haitifeed/http://wayback.archive-it.org/2358/20141225080305/https:/www.facebook.com/elshaheeed.co.uk 68
  • 72. We extract the metadata of the pages and order them chronologically { "elements":[ { "permalink":"http://wayback.archive-it.org/694/20070523182134/http://www.usatoday.com/news/nation/2007-04-16- virginia-tech_N.htm", "type":"link", "source":{"href":"http://www.usatoday.com", "name":"www.usatoday.com @ 23, May 2007"} }, { "permalink":"http://wayback.archive- it.org/694/20070530182159/http://www.time.com/time/specials/2007/vatech_victims", "type":"link", "source":{"href":"http://www.time.com", "name":"www.time.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070530182206/http://www.collegiatetimes.com/", "type":"link", "source":{"href":"http://www.collegiatetimes.com", "name":"www.collegiatetimes.com @ 30, May 2007" } }, { "permalink":"http://wayback.archive-it.org/694/20070606234248/http://hokies416.wordpress.com/", "type":"link", "source":{ "href":"http://hokies416.wordpress.com", "name":"hokies416.wordpress.com @ 06, Jun 2007" } }, … { "permalink":"http://wayback.archive-it.org/694/20070620234329/http://www.hokiesports.com/april16/", "type":"link", "source":{"href":"http://www.hokiesports.com", "name":"www.hokiesports.com @ 20, Jun 2007" } }, ], "description":"This is an automatically generated story from Archive-It collection.", "title":"April 16 Archive ” } 72 Using the Storify API, we override the default metadata to generate more attractive snippets
  • 73. Example of an automatically generated story 73 Notice the good metadata: images, titles with dates, favicons
  • 74. Evaluating the Dark and Stormy Archive framework (how good are the automatically generated stories?) 74
  • 75. Evaluation is tricky! (two perfectly good stories could have non-overlapping k=28 elements!) • We use human evaluators (via Amazon's Mechanical Turk) to compare: • Human-generated stories • DSA (automatically) generated stories • Randomly generated stories • Successful evaluation means: • Human and DSA stories are indistinguishable • Human and DSA stories are better than Random 75
  • 76. Our guidelines for expert archivists at Archive-It for generating stories from the collections 76
  • 77. We received 23 stories for 10 Archive-It collections SPST is “Sliding Page, Sliding Time” SPFT is “Sliding Page, Fixed Time” FPST is “Fixed Page, Sliding Time” 77
  • 78. https://storify.com/mturk_exp/3649b1s-57218803f5db94d11030f90b 78 • Generated by domain experts • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 79. Automatically generated stories from archived collections 1. Obtain the seed list and the TimeMap of URIs from the front-end interface of Archive- It 2. Extract the HTML of the mementos from the WARC files (locally hosted at ODU) and download the collections that we do not have in the ODU mirror from Archive-It 3. Extract the text of the page using the Boilerpipe library 4. Eliminate the off-topic pages based on the best-performing method ((Cosine, Word-Count) with the suggested thresholds (0.1, −0.85)) 5. Exclude duplicates in each TimeMap 6. Eliminate the non-English language pages 7. Slice the collection dynamically and then cluster the mementos of each slice using DBSCAN algorithm 8. Apply the quality metrics to select the best representative pages 9. Sort the selected mementos chronologically then put them and their metadata in a JSON object 79
  • 80. https://storify.com/mturk_exp/3649b0s 80 • Automatically generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 81. Random stories 28 mementos were randomly selected from each collection before excluding off-topic and duplicate pages 81
  • 82. https://storify.com/mturk_exp/3649b2s-57227227bb79 048c2d0388dc 82 • Randomly generated story • Sliding Page, Sliding Time • The Boston Marathon Bombing collection
  • 83. https://storify.com/mturk_exp/3649bads 83 if someone prefers this story, we exclude their results • Poorly generated story • The same memento, 28 times • The Boston Marathon Bombing collection
  • 84. MT experiment setup • Three HITs for each story (69 HITs to evaluate 23 stories); two comparisons per HIT: • HIT1: human vs. automatic, human vs. poor • HIT2: human vs. random, human vs. poor • HIT3: random vs. automatic, automatic vs. poor • 15 distinct turkers with master qualification (i.e., high acceptance rate) for each HIT • We rejected the submissions contained poorly-generated stories and the HITs that were completed in less than 10 seconds (mean time per HIT = 7 minutes) • 989 out of 1,035 (69*15) valid HITs • We awarded the turker $0.50 per HIT 84https://www.mturk.com/mturk/help?helpPage=worker#what_is_master_worker
  • 87. Automatic versus Human 87 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  • 88. Human versus Random Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time 88
  • 89. Automatic versus Random 89 Sliding Page, Sliding Time Sliding Page, Fixed Time Fixed Page, Sliding Time
  • 90. Success! DSA-generated stories are just as good as stories generated by human experts 90
  • 91. Use interface people already know how to use to summarize collections Archived collectionsStorytelling services Archived enriched stories 91 All the code, datasets, papers, slides, etc.: http://bit.ly/YasminPhD

Editor's Notes

  1. First deployed in 2006, Archive-It is a subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content. 
  2. Lori created the collections and entered metadata about them,description, title, etc Collection level metadata but it doesn’t help a lot Archive-It provides faceted browsing and search services on the resulting collection
  3. , there are about 3 or 4 collections about egyptian revolution in Archive-it, If I want to know about the egy rev, which collection should I browse?? Collection is two dimensions <<URIs, and copies of these URIs>> Historian with more than one collection will not know where to start
  4. Even we have these vis, we still do not have what is the content of these collections. Users have to go manually through the mementos to understand the collection, so the user has to inspect manually a lot things
  5. We concluded that the conventional viz methods, which we try to visualize everything in the collection are not applicable. The conventional methods are not applicable for
  6. So how about using storytelling ??
  7. Every story is made up of a set of events. Stories in literature has elemnts, such as setting, characters, sequence, etc. We use ``story'' in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc What we mean here by Storytelling here is using visualizations to put a set of web pages from web archives in a narrative structure, ordered by time
  8. Story def. in social media much looser and more relaxed. in social media, it is more arranging resources through time Storytelling may be seen as the set of cultural practices for representing events chronologically.
  9. Because of the sheer volume of information on the web, “storytelling” is becoming a popular technique in social media for selecting representative tweets, videos, web pages, etc. and arranging them in chronological order to support a particular narrative or “story”4. We use “story” in its current, loose context of social media, which is sometimes missing elements from the more formal literary tradition of dramatic structure, morality, humor, improvisation, etc
  10. Storytelling looks promising but what are the problems of storytelling
  11. This is an example story for the egyptian revolution on storify, Storify Storify is a storytelling service that lets the user create stories or narratives using social media and web pages. Storify was launched in September 2010, and has been open to the public since April 2011. storytelling” is best typified by the company Storify http://storify.com/nzherald/mu http://storify.com/nzherald/mu http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705546
  12. The problem is that storify operate as bookmarking, it doesn’t preserve the links You have no clue of what the person is saying about the link http://www.nzherald.co.nz/world/news/article.cfm?c_id=2&objectid=10705523 http://storify.com/nzherald/mu
  13. So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.
  14. So if this is the web, the archived collections are subsets from the web, we will sample from these collections to create a story….. Then place those generated samples in a social media interface that people already know: Storify
  15. I went through these collection and sampled what I thought interesting pages, ordered them by time, and put them on storify so yousof and his generation can see it later. I took hours in selecting the resources in these handcrafted stories Although that I know the egyptian revolution very well, it wasn’t easy to select these pages from all the pages in the collection to represent the story.
  16. For example, here is the same page, it is different based on desktop and mobile. The archives typically don’t have those versions, so currently we can’t generate this story. http://www.cnn.com/2014/02/24/world/africa/egypt-politics/index.html?hpt=wo_c2 http://america.aljazeera.com/ Personalized Web resources offer different representations based on the user-agent string and other values in the HTTP request headers, GeoIP, and other environmental factors. Currently web archives don’t support browsing different representation. This means Web crawlers capturing content for archives may receive representations based on the crawl environment which will differ from the representations returned to the interactive users.
  17. For example, here is cnn blog at it evolved over time. You can get a sense of how the story is evolved through time from looking at the images here. http://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ http://wayback.archive-it.org/2358/*/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/
  18. For example, This is feb. 11. we can see how cnn reported it, and we can see how bbc covered the news. Here is feb 11 from different news sites This story is very important for humanities researchers, https://wayback.archive-it.org/2358/20110211074248/http://www.globalpost.com/dispatch/egypt/110210/mubarak-resign-obama-egypt https://wayback.archive-it.org/2358/20110211191445/http://www.cnn.com/ https://wayback.archive-it.org/2358/20110211192204/http://www.bbc.co.uk/news/world-middle-east-12433045 https://wayback.archive-it.org/2358/20110211192142/http://www.modernegypt.info/ https://wayback.archive-it.org/2358/20110211191423/http://news.blogs.cnn.com/category/world/egypt-world-latest-news/ https://wayback.archive-it.org/2358/20110211191423/http://www.arabist.net/ https://wayback.archive-it.org/2358/20110211194239/http://www.globalpost.com/dispatch/egypt/110211/mubarak-quits-resigns-egypt-cairo
  19. And here I want to get the broadest coverage possible for the egyptian revolution sampling from the entire collection For example, we can see here the news from cnn about shutting down the internet. Also the news about mubarak resigning on feb 11 from bbc
  20. To generate these stories, we introduced the Dark and stormy archives framework. The framework has three main components: First, Establishing a basline of human generated stories and ait collections
  21. First, we check the human generated stories using stories from storify To have a descriptive model of how good stories look like
  22. So we quantified , the number of resources in the stories,
  23. Twitter is the most popular domain in storify stories, and you can notice here that twitter dominate the top list with large %
  24. We looked at what make a good story
  25. We looked at five this, and we found two things Popular stories tend to have: more web elements (medians of 28 vs. 21) longer timespan (5 hours vs. 2 hours) than the unpopular stories longer editing time intervals than the unpopular stories
  26. It shows that the resources of the pop- ular stories tends to stay longer than the resources of the unpopular.
  27. This is the most important thing that you need to remember This is will be used as template for our automatically generated stories.
  28. What we archive and what we put in our stories are different subsets of the web
  29. Archive-It provides their partners with tools that allow them to build themed collections of archived Web pages hosted on Archive-It's machines. This is done by the user manually specifying a set of \emph{seeds}, Uniform Resource Identifiers (URIs) that should be crawled periodically (the frequency is tunable by the user), and to what depth (e.g., follow the pages linked to from the seeds two levels out)
  30. Archive-It provide curators about http events, like how many html file, pdfs, http responses… File types and so on However, the tools are currently focused on issues such as the mechanics of HTTP (e.g., how many HTML files vs. PDFs, how many HTTP 404 responses) and domain information (e.g., how many .uk sites vs. .com sites). Currently, there are no content-based tools that allow curators to detect when seed URIs go off-topic.
  31. Here is a dude running for office, Starts as off-topic, went off-topci because of DB error, then went off because of financial problems, Then it went on-on topic again, then it was hacked, then the domain was expired for sale. We don’t necessarily want to get rid of this because it documents what happened But if r gonna choose pages for a story, db error won’t be a good candidate for a story
  32. These are the scores of two similar pages The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  33. And these are the score for two mementos in which one is off-topic because the domain is lost These are the scores of two similar pages The textual content: cosine similarity intersection of the most frequent terms Jaccard coefficient The semantics of the text: Web based kernel function using the search engine (SE) Structural methods: the change in number of words the change in content length
  34. These two mementos are about egypt but termwise, they don’t overlap.
  35. For three collections, we went through and manually determined whther the mementos are off-topic and on-topic to the mementos in the URI-rs, like the page I showed b4 to be off I would never wanna do this again
  36. Cosine similarity at threshold = 0.15 is the best single method If cosine similarity between candidate memento and first memento < 0.15, then candidate memento is marked as 'off-topic' If cosine similarity between candidate memento and first memento < 0.10 OR word count between candidate memento and first memento has decreased by more than 85%, then candidate memento is marked as 'off-topic'
  37. FP - classified as off-topic, but really on-topic There are some collections where we didn’’t find off-topic URIs, And we have other collections that have big chunk of off-topic mementos for lost of seeds URIs. For some collections, there are around 10-15 % of the mementos of the collection are off-topic
  38. For gaining insight about how to slice the collection, we visualized the memento-datetimes (the crawl time of th URIs) of many collection,
  39. Earlier we talked about two dimensions for stories, In this collection, here are the two dimensions. In the x-axis,we have time, and y-axis has URIs Pretty much for each Url, the crawling are has the same amount of time.
  40. Textual methods will combine these two pages in the same cluster. So we needed an automated method to The same news from two different sites, we pick the better memento in terms of quality
  41. https://wayback.archive-it.org/2358/20110207193404/http://news.blogs.cnn.com/2011/02/07/egypt-crisis-country-to-auction-treasury-bills/ https://wayback.archive-it.org/2358/20110207194425/http://www.cnn.com/2011/WORLD/africa/02/07/egypt.google.executive/index.html?hpt=T1
  42. Remember this, this is not a good snippet. These are not good snippet, We can do better than this. we ovverrise the favicon, add he date to the title,
  43. Remember this, this is not a good snippet. These are not good snippet, We can do better than this. we override the favicon, add he date to the title,
  44. So we generate a json object with the metadata of the stories and push them on storify using Storify API. After we select the best set of pages that represent a story, we extract the metadata of the mementos and put them into json format for visualization We have done a lot of work to override the default metadata that storify extracts.
  45. We consider a story to be ``good'' if a person considers it to be indistinguishable from a human-generated story. Furthermore, the human and the automatic stories should be better than the random stories. A successful evaluation should have: Human and dsa should be indistinguishable And human and dsa should be better than random
  46. They are familiar of the collections, I gave them these guidelines
  47. We obtained 23 stories for 10 collections. Some collections do not have stories because
  48. Random stories. When you choose random you get what get, sometimes is good and sometimes is bad.
  49. The turkers are presented with two sits of comparisons, each comparison has two stories and we ask them which story better summarize the topic. They can scroll down, through the stories they can click on any memento.
  50. So what we want to do is to create persistent stories then visualize them using storytelling tool that users already know about, such as storify. So we will integrate the story telling servises and the archived collections to generate archived enriched stories.