Capstone slidedeck for my capstone project part 2.pdf
Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
1. Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
1
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
2. Alexander C. Nwala
Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL) Doctoral Consortium
June 3, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
2
Thank you SIGIR for the Travel Grant
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
3. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
3
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
4. In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
7. ● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
8. Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
9. ● The Internet Archive and Archive-It (a service of the Internet Archive) have
on multiple occasions requested that users submit seeds via Google Docs
for:
9
Seeds may be generated by multiple users
11. 12
Users on social media share stories that include hand-selected URIs
● The Wikipedia page about
the Stoneman Douglas
High School shooting
● created the same day as
the shooting event
(February 14, 2018)
● We consider Wikipedia
references an example of a
Micro-collection
We propose extracting URIs
from micro-collections such
as Wikipedia references to
generate seeds
12. 13
More micro-collections: extract URIs from
Twitter Moments to generate seeds:
● Stoneman Douglas High School shooting Twitter
Moment created the day after event
13. 14
Storify story published Jan 2014:
“Protests In Kiev Turn Violent,”
before the major event:
Russian annexation of Crimea
(started late February 2014)
Micro-collections often start early before major events
14. 15
Archive-It collection for the event
potentially omits some of the prelusive
contents in the Storify micro-collection
Micro-collections may include prelusive events lacking in
collections triggered by major events
Storify story of the Ukrainian crisis event
(January 2014) highlights riots before Russian
annexation of Crimea (late February).
15. We propose extracting URIs from social media
micro-collections to bootstrap archived collections
or augment curator-selected seeds.
16
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
17. Taking the effort to create micro-collections is an indication of
editorial effort, and thus presumably quality of the seeds.
18
Wikipedia references for
Stoneman Douglas High
School Shooting
18. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
19
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
19. ● Research question 1:
○ Are seeds that are generated automatically from micro-collections in
social media comparable to curator-generated seeds?
○ What quantitative method(s) can be used to compare collections?
● Research question 2:
○ If we consider curator hand-selected seeds the gold standard for
collections, could this lead to the definition of what makes a collection
good?
○ How do we assess the quality of collections at scale?
20
Primary research questions
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
20. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
21
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
21. ● We implemented a prototype system for generating seeds from the
following social media sites:
○ Storify (out of service since May 16, 2018),
○ Twitter Moments,
○ Reddit, and
○ Wikipedia
● We also generated seeds from the Google SERP as a baseline to compare
social media micro-collections since we believe SERPs are a primary
source of discovering seeds. 22
Generating seed URIs from social media
23. ● Storify was a social media curation
service that enables users to create
stories that consist of hand-selected web
resources such as:
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Storify stories
○ Unfortunately, Storify went out of
service in May 2018
○ http://ws-
dl.blogspot.com/2017/08/2017-08-
11-where-can-we-post-stories.html 24
Generating seeds from Storify
24. ● Twitter Moments is a service by Twitter
that lets users create topical collections
of tweets.
● Tweets in Twitter Moments embed
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Twitter Moments
25
Generating seeds from Twitter Moments
25. ● Reddit is a service that allows users to
post URIs for various topics.
● Reddit users rate the URIs and post
comments that may also include URIs
○ Seeds = URIs in Reddit pages and
comments
26
Generating seeds from Reddit
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
26. ● Wikipedia is a service that enables
multiple contributors to create
documents about various topics ranging
from politics to science and technology
● The references of Wikipedia documents
include URIs relevant to the document
topic
○ Seeds = URIs in Wikipedia
references
27
Generating seeds from Wikipedia
Wikipedia references for
Stoneman Douglas High
School Shooting
27. 28
Not all micro-collections yield high quality seeds:
How do we recognize low quality seeds at scale?
Spam links in tweetsHijacked hashtagCan we assess
authority of source?
Infowars
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
28. 29
Example of potential
seeds generated from the
Google SERP for query:
“hurricane harvey”
Seeds generated from SERPs can
be used as a baseline to compare
social media micro-collections
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
29. 31
● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01
- 0.11, and monthly rate: 0.01 - 0.08
The probability of finding the URI of a news story diminishes
with time
30. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
32
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
31. 33
● It requires comparing collections that may cater to different needs
● We explored foundational work in collection characterization from
Library and Web Sciences
○ Defined a suite of 7 measures (Collection Characterizing Suite -
CCS)
● The CCS is used to describe individual collections and compare multiple
collections
Characterizing or comparing collections is a challenging task
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
32. 34
1. Distribution of topics: a ranked list of topics in a collection
a. “ebola outbreak west africa”
b. “guinea liberia sierra leone”
c. “cases ebola virus disease”
d. “public health workers”
e. “centers disease control prevention”
1. Distribution of sources (hostnames): a statistical summary of the
various sources sampled in order to build the collection:
a. 18 (12.5%) web pages from blogs.plos.org,
b. 14 (9.7%) from cdc.gov, and
c. 11 (7.6%) from twitter.com
(Top 10 hosts fraction of collection: 50%)
1. Temporal distribution - Publication & Content: collection of the dates in a collection:
“From August 2014–December 2015, the guidance was accessed online...The guidance
was retired on February 19, 2016, when more than 45 days had passed since Guinea
was declared free of Ebola virus transmission, because widespread human-to-human
transmission was at an end” Page last updated: December 27, 2017
CCS: NLM Ebola virus collection example
33. 35
4. Content diversity: a value between 0 and 1 indicating the degree of
self-similarity of the text content of the collection
○ 0 - no diversity; duplicate documents
○ 1 - maximum diversity; documents without any common
vocabulary
Quantifying textual diversity in a collection
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
34. 36
Content diversity example (colors = collections, numbers = stories)ID News Titles
Collections
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
Roy Moore Wins
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
Hurricane Harvey
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
Vegas Shooting
9 “Las Vegas shooting: What we know”
diversity scoresCollections
= 0.39
= 0.58
= 0.30
1 1 1 = 0.00
= 0.00
= 0.00
= 1.00
= 1.00
= 0.75
2 2 2
3 3 3
1 2 3
1 2 3
1 2 3
1 4 7
1 8 9
1 2 3 4 5 6 7 8 9
35. 37
5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a
collection samples a single source, a handful of sources, or many sources.
There are multiple ways of measuring URL diversity
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
36. 38
6. Collection exposure - Archival rate and Tweet index rate: approximates popularity
● Archival rate: fraction of archived URIs in collection
● Tweet index rate: fraction of URIs in collection found embedded in tweets
7. Target audience: approximates target audience of a collection with readability scores
grade level - title - source
CCS: Approximating popularity and target audience
7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com
11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com
12th - “Harvey Puts Houston Underwater” - dailycaller.com
18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular
receptor” - nih.gov
20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
37. 39
● We represented
each collection as an
n-dimensional vector
of CCS values.
● Calculated distance
between vectors.
Comparing collections with CCS
Doc-Term content diversity 0.86 0.89
List of entity set content diversity 0.65 0.85
URI diversity 1.00 0.98
Domain diversity 0.34 0.50
Hostname diversity 0.43 0.53
Social media rate 0.07 0.12
Archical rate 0.99 0.78
Tweet index rate 0.72 0.40
Exposure rate (reading level) 0.61 0.61
n-gram similarity of topic distribution 1.00 0.70
Normalized Euclidean distance 0.17
Archive-It Col. Reddit Col.CCS metrics
38. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
40
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
39. 41
JCDL 2016: need for using local news sources to build collections for local
events.
Summary of completed research (2016)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
40. 42
HyperText 2018: Introduced Collection Characterizing Suite for characterizing and
comparing collections
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
41. 43
JCDL 2018: Investigated discoverability of URIs of news stories on SERPs
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
42. 44
● Studying SERPs: A Supervised Learning Algorithm for Binary Domain
Classification of Web Queries using SERPs (JCDL 2016 Poster)
● Interacting with Twitter
a. Extracting tweet conversations
b. Finding URLs on Twitter
● Extracting text from news documents
● Finding Storify stories
Outline of work that informed this research
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
43. Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
45
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
44. 46
Schedule for pending research for 2018-2019
2018-06 2019-12
Identify hubs & authorities
in social media
2018-12
2018-06 - 2018-12
Candidacy proposal 2018-06 - 2018-12
Implement seed generation
system
2019-01 -
2019-03
2019-04 - 2018-08Evaluate seed generation
system
Dissertation/Defense
2019-09 - 2019-
12
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
45. Conclusions
47
● Archived collections offers a way of preserving the historic record of important events
and begin with seeds.
● We propose exploiting micro-collections on social media to augment or bootstrap
archived collections for stories and events.
○ Introduced the CCS for characterizing and comparing collections
○ We showed that micro-collections generated from from social media are similar to
Archive-It seeds
● Primary research tasks remaining:
○ Identify hubs and authorities in social media a method to evaluate quality at scale
○ Investigate what makes “good” seeds and implement/evaluate seed generation
system
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
@acnwala @webscidl
Thank you!