Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
1
@acnwala • JCDL 2018 • June 3, 2018

Alexander C. Nwala
Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL) Doctoral Consortium
June 3, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
2
Thank you SIGIR for the Travel Grant

Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
3

In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog

5
http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/
A few months after the Ebola outbreak, an Archivist at the National Library of
Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak.

6
Archive-It Ebola virus seeds
http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/
http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html
http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html
http://www.cdc.gov/mmwr/ebola_reports.html
http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html
http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations
http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html
http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/
http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/
http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html
http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html
http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html
http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/
http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html
http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1
http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/
http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105
http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105
http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/
http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx
http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/
http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/
http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/
Sample of Archive-It
Ebola virus seeds

● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds

Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy

● The Internet Archive and Archive-It (a service of the Internet Archive) have
on multiple occasions requested that users submit seeds via Google Docs
for:
9
Seeds may be generated by multiple users

10
Sample seeds
contributed
for the Boston
Marathon
Bombing
Collection

12
Users on social media share stories that include hand-selected URIs
● The Wikipedia page about
the Stoneman Douglas
High School shooting
● created the same day as
the shooting event
(February 14, 2018)
● We consider Wikipedia
references an example of a
Micro-collection
We propose extracting URIs
from micro-collections such
as Wikipedia references to
generate seeds

13
More micro-collections: extract URIs from
Twitter Moments to generate seeds:
● Stoneman Douglas High School shooting Twitter
Moment created the day after event

14
Storify story published Jan 2014:
“Protests In Kiev Turn Violent,”
before the major event:
Russian annexation of Crimea
(started late February 2014)
Micro-collections often start early before major events

15
Archive-It collection for the event
potentially omits some of the prelusive
contents in the Storify micro-collection
Micro-collections may include prelusive events lacking in
collections triggered by major events
Storify story of the Ukrainian crisis event
(January 2014) highlights riots before Russian
annexation of Crimea (late February).

We propose extracting URIs from social media
micro-collections to bootstrap archived collections
or augment curator-selected seeds.
16

http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-
zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html
https://twiter.com/ebola_response/
https://www.who.int/mediacentre/news/ebola/en/
http://allafrica.com/stories/201407310957.html
http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html
http://jid.oxfordjournals.org/content/204/suppl_3/S785_long
http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804
https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106
http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest
http://www.who.int/csr/disease/ebola/en/
http://www.nature.com/articles/nature10348 17
Sample of seed URIs for Ebola virus topic
Reddit SERP and
comments
Archive-It seeds
Wikipedia references
Micro-collections

Taking the effort to create micro-collections is an indication of
editorial effort, and thus presumably quality of the seeds.
18
Wikipedia references for
Stoneman Douglas High
School Shooting

Outline
3. Preliminary work
Media
4. Proposed work
5. Conclusions
19

● Research question 1:
○ Are seeds that are generated automatically from micro-collections in
social media comparable to curator-generated seeds?
○ What quantitative method(s) can be used to compare collections?
● Research question 2:
○ If we consider curator hand-selected seeds the gold standard for
collections, could this lead to the definition of what makes a collection
good?
○ How do we assess the quality of collections at scale?
20
Primary research questions

Outline
3. Preliminary work
Media
4. Proposed work
5. Conclusions
21

● We implemented a prototype system for generating seeds from the
following social media sites:
○ Storify (out of service since May 16, 2018),
○ Twitter Moments,
○ Reddit, and
○ Wikipedia
● We also generated seeds from the Google SERP as a baseline to compare
social media micro-collections since we believe SERPs are a primary
source of discovering seeds. 22
Generating seed URIs from social media

23
Social media micro-collection were
similar to Archive-It seeds
≈
Euclidean distance range between collections: 0.17 to

● Storify was a social media curation
service that enables users to create
stories that consist of hand-selected web
resources such as:
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Storify stories
○ Unfortunately, Storify went out of
service in May 2018
○ http://ws-
dl.blogspot.com/2017/08/2017-08-
11-where-can-we-post-stories.html 24
Generating seeds from Storify

● Twitter Moments is a service by Twitter
that lets users create topical collections
of tweets.
● Tweets in Twitter Moments embed
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Twitter Moments
25
Generating seeds from Twitter Moments

● Reddit is a service that allows users to
post URIs for various topics.
● Reddit users rate the URIs and post
comments that may also include URIs
○ Seeds = URIs in Reddit pages and
comments
26
Generating seeds from Reddit

● Wikipedia is a service that enables
multiple contributors to create
documents about various topics ranging
from politics to science and technology
● The references of Wikipedia documents
include URIs relevant to the document
topic
○ Seeds = URIs in Wikipedia
references
27
Generating seeds from Wikipedia
Wikipedia references for
Stoneman Douglas High
School Shooting

28
Not all micro-collections yield high quality seeds:
How do we recognize low quality seeds at scale?
Spam links in tweetsHijacked hashtagCan we assess
authority of source?
Infowars

29
Example of potential
seeds generated from the
Google SERP for query:
“hurricane harvey”
Seeds generated from SERPs can
be used as a baseline to compare
social media micro-collections

31
● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01
- 0.11, and monthly rate: 0.01 - 0.08
The probability of finding the URI of a news story diminishes
with time

Outline
3. Preliminary work
Media
4. Proposed work
5. Conclusions
32

33
● It requires comparing collections that may cater to different needs
● We explored foundational work in collection characterization from
Library and Web Sciences
○ Defined a suite of 7 measures (Collection Characterizing Suite -
CCS)
● The CCS is used to describe individual collections and compare multiple
collections
Characterizing or comparing collections is a challenging task

34
1. Distribution of topics: a ranked list of topics in a collection
a. “ebola outbreak west africa”
b. “guinea liberia sierra leone”
c. “cases ebola virus disease”
d. “public health workers”
e. “centers disease control prevention”
1. Distribution of sources (hostnames): a statistical summary of the
various sources sampled in order to build the collection:
a. 18 (12.5%) web pages from blogs.plos.org,
b. 14 (9.7%) from cdc.gov, and
c. 11 (7.6%) from twitter.com
(Top 10 hosts fraction of collection: 50%)
1. Temporal distribution - Publication & Content: collection of the dates in a collection:
“From August 2014–December 2015, the guidance was accessed online...The guidance
was retired on February 19, 2016, when more than 45 days had passed since Guinea
was declared free of Ebola virus transmission, because widespread human-to-human
transmission was at an end” Page last updated: December 27, 2017
CCS: NLM Ebola virus collection example

35
4. Content diversity: a value between 0 and 1 indicating the degree of
self-similarity of the text content of the collection
○ 0 - no diversity; duplicate documents
○ 1 - maximum diversity; documents without any common
vocabulary
Quantifying textual diversity in a collection

36
Content diversity example (colors = collections, numbers = stories)ID News Titles
Collections
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
Roy Moore Wins
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
Hurricane Harvey
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
Vegas Shooting
9 “Las Vegas shooting: What we know”
diversity scoresCollections
= 0.39
= 0.58
= 0.30
1 1 1 = 0.00
= 0.00
= 0.00
= 1.00
= 1.00
= 0.75
2 2 2
3 3 3
1 2 3
1 2 3
1 2 3
1 4 7
1 8 9
1 2 3 4 5 6 7 8 9

37
5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a
collection samples a single source, a handful of sources, or many sources.
There are multiple ways of measuring URL diversity
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html

38
6. Collection exposure - Archival rate and Tweet index rate: approximates popularity
● Archival rate: fraction of archived URIs in collection
● Tweet index rate: fraction of URIs in collection found embedded in tweets
7. Target audience: approximates target audience of a collection with readability scores
grade level - title - source
CCS: Approximating popularity and target audience
7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com
11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com
12th - “Harvey Puts Houston Underwater” - dailycaller.com
18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular
receptor” - nih.gov
20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org

39
● We represented
each collection as an
n-dimensional vector
of CCS values.
● Calculated distance
between vectors.
Comparing collections with CCS
Doc-Term content diversity 0.86 0.89
List of entity set content diversity 0.65 0.85
URI diversity 1.00 0.98
Domain diversity 0.34 0.50
Hostname diversity 0.43 0.53
Social media rate 0.07 0.12
Archical rate 0.99 0.78
Tweet index rate 0.72 0.40
Exposure rate (reading level) 0.61 0.61
n-gram similarity of topic distribution 1.00 0.70
Normalized Euclidean distance 0.17
Archive-It Col. Reddit Col.CCS metrics

Outline
3. Preliminary work
Media
4. Proposed work
5. Conclusions
40

41
JCDL 2016: need for using local news sources to build collections for local
events.
Summary of completed research (2016)

42
HyperText 2018: Introduced Collection Characterizing Suite for characterizing and
comparing collections
Summary of completed research (2017-2018)

43
JCDL 2018: Investigated discoverability of URIs of news stories on SERPs
Summary of completed research (2017-2018)

44
● Studying SERPs: A Supervised Learning Algorithm for Binary Domain
Classification of Web Queries using SERPs (JCDL 2016 Poster)
● Interacting with Twitter
a. Extracting tweet conversations
b. Finding URLs on Twitter
● Extracting text from news documents
● Finding Storify stories
Outline of work that informed this research

Outline
3. Preliminary work
Media
4. Proposed work
5. Conclusions
45

46
Schedule for pending research for 2018-2019
2018-06 2019-12
Identify hubs & authorities
in social media
2018-12
2018-06 - 2018-12
Candidacy proposal 2018-06 - 2018-12
Implement seed generation
system
2019-01 -
2019-03
2019-04 - 2018-08Evaluate seed generation
system
Dissertation/Defense
2019-09 - 2019-
12

Conclusions
47
● Archived collections offers a way of preserving the historic record of important events
and begin with seeds.
● We propose exploiting micro-collections on social media to augment or bootstrap
archived collections for stories and events.
○ Introduced the CCS for characterizing and comparing collections
○ We showed that micro-collections generated from from social media are similar to
Archive-It seeds
● Primary research tasks remaining:
○ Identify hubs and authorities in social media a method to evaluate quality at scale
○ Investigate what makes “good” seeds and implement/evaluate seed generation
system
@acnwala @webscidl
Thank you!

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

Similaire à Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media (20)

Plus de Alexander Nwala

Plus de Alexander Nwala (6)

Dernier

Dernier (17)

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

Notes de l'éditeur