SlideShare une entreprise Scribd logo
1  sur  45
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
1
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Alexander C. Nwala
Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle
Old Dominion University
Web Science & Digital Libraries Research Group
@acnwala • @WebSciDL
Joint Conference on Digital Libraries (JCDL) Doctoral Consortium
June 3, 2018, Fort Worth, TX
This work was made possible in
part by IMLS LG-71-15-0077-15
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
2
Thank you SIGIR for the Travel Grant
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
3
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
In March 2014, there was a serious outbreak of Ebola in West Africa
1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/
4
The outbreak severely
affected Guinea, Liberia,
and Sierra Leone with
about 11,000 deaths1.
http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
5
http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/
A few months after the Ebola outbreak, an Archivist at the National Library of
Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak.
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
6
Archive-It Ebola virus seeds
http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/
http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf
http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html
http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html
http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html
http://www.cdc.gov/mmwr/ebola_reports.html
http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html
http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations
http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html
http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html
http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/
http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/
http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html
http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html
http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html
http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/
http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html
http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1
http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/
http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105
http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105
http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/
http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx
http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/
http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/
http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/
Sample of Archive-It
Ebola virus seeds
● A seed list is an initial collection exemplar web pages for a topic
○ seeds + linked pages form a collection when crawled
● Archived web collections consist of groups of web pages that share a
common topic e.g., “Ebola virus” and “2018 Winter Olympics.”
● Human-generated seeds are high-quality, but expensive to generate
7
Archived web collections begin with seeds
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Archived web collections offer a way of preserving the
historic record of important events
8
http://xhosaculture.co.za/
Mandela’s legacy
https://www.wsj.com/
2016 Dakota Access Pipeline
http://www.nj.com/
2018 Winter Olympics
http://xhosaculture.co.za/
Mandela’s legacy
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● The Internet Archive and Archive-It (a service of the Internet Archive) have
on multiple occasions requested that users submit seeds via Google Docs
for:
9
Seeds may be generated by multiple users
10
Sample seeds
contributed
for the Boston
Marathon
Bombing
Collection
12
Users on social media share stories that include hand-selected URIs
● The Wikipedia page about
the Stoneman Douglas
High School shooting
● created the same day as
the shooting event
(February 14, 2018)
● We consider Wikipedia
references an example of a
Micro-collection
We propose extracting URIs
from micro-collections such
as Wikipedia references to
generate seeds
13
More micro-collections: extract URIs from
Twitter Moments to generate seeds:
● Stoneman Douglas High School shooting Twitter
Moment created the day after event
14
Storify story published Jan 2014:
“Protests In Kiev Turn Violent,”
before the major event:
Russian annexation of Crimea
(started late February 2014)
Micro-collections often start early before major events
15
Archive-It collection for the event
potentially omits some of the prelusive
contents in the Storify micro-collection
Micro-collections may include prelusive events lacking in
collections triggered by major events
Storify story of the Ukrainian crisis event
(January 2014) highlights riots before Russian
annexation of Crimea (late February).
We propose extracting URIs from social media
micro-collections to bootstrap archived collections
or augment curator-selected seeds.
16
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-
zero/
http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/
https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html
https://twiter.com/ebola_response/
https://www.who.int/mediacentre/news/ebola/en/
http://allafrica.com/stories/201407310957.html
http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html
http://jid.oxfordjournals.org/content/204/suppl_3/S785_long
http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804
https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106
http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest
http://www.who.int/csr/disease/ebola/en/
http://www.nature.com/articles/nature10348 17
Sample of seed URIs for Ebola virus topic
Reddit SERP and
comments
Archive-It seeds
Wikipedia references
Micro-collections
Taking the effort to create micro-collections is an indication of
editorial effort, and thus presumably quality of the seeds.
18
Wikipedia references for
Stoneman Douglas High
School Shooting
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
19
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● Research question 1:
○ Are seeds that are generated automatically from micro-collections in
social media comparable to curator-generated seeds?
○ What quantitative method(s) can be used to compare collections?
● Research question 2:
○ If we consider curator hand-selected seeds the gold standard for
collections, could this lead to the definition of what makes a collection
good?
○ How do we assess the quality of collections at scale?
20
Primary research questions
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
21
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● We implemented a prototype system for generating seeds from the
following social media sites:
○ Storify (out of service since May 16, 2018),
○ Twitter Moments,
○ Reddit, and
○ Wikipedia
● We also generated seeds from the Google SERP as a baseline to compare
social media micro-collections since we believe SERPs are a primary
source of discovering seeds. 22
Generating seed URIs from social media
23
Social media micro-collection were
similar to Archive-It seeds
≈
Euclidean distance range between collections: 0.17 to
● Storify was a social media curation
service that enables users to create
stories that consist of hand-selected web
resources such as:
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Storify stories
○ Unfortunately, Storify went out of
service in May 2018
○ http://ws-
dl.blogspot.com/2017/08/2017-08-
11-where-can-we-post-stories.html 24
Generating seeds from Storify
● Twitter Moments is a service by Twitter
that lets users create topical collections
of tweets.
● Tweets in Twitter Moments embed
○ URIs of news articles, images,
videos, etc.
○ Seeds = URIs in Twitter Moments
25
Generating seeds from Twitter Moments
● Reddit is a service that allows users to
post URIs for various topics.
● Reddit users rate the URIs and post
comments that may also include URIs
○ Seeds = URIs in Reddit pages and
comments
26
Generating seeds from Reddit
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
● Wikipedia is a service that enables
multiple contributors to create
documents about various topics ranging
from politics to science and technology
● The references of Wikipedia documents
include URIs relevant to the document
topic
○ Seeds = URIs in Wikipedia
references
27
Generating seeds from Wikipedia
Wikipedia references for
Stoneman Douglas High
School Shooting
28
Not all micro-collections yield high quality seeds:
How do we recognize low quality seeds at scale?
Spam links in tweetsHijacked hashtagCan we assess
authority of source?
Infowars
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
29
Example of potential
seeds generated from the
Google SERP for query:
“hurricane harvey”
Seeds generated from SERPs can
be used as a baseline to compare
social media micro-collections
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
31
● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01
- 0.11, and monthly rate: 0.01 - 0.08
The probability of finding the URI of a news story diminishes
with time
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
32
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
33
● It requires comparing collections that may cater to different needs
● We explored foundational work in collection characterization from
Library and Web Sciences
○ Defined a suite of 7 measures (Collection Characterizing Suite -
CCS)
● The CCS is used to describe individual collections and compare multiple
collections
Characterizing or comparing collections is a challenging task
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
34
1. Distribution of topics: a ranked list of topics in a collection
a. “ebola outbreak west africa”
b. “guinea liberia sierra leone”
c. “cases ebola virus disease”
d. “public health workers”
e. “centers disease control prevention”
1. Distribution of sources (hostnames): a statistical summary of the
various sources sampled in order to build the collection:
a. 18 (12.5%) web pages from blogs.plos.org,
b. 14 (9.7%) from cdc.gov, and
c. 11 (7.6%) from twitter.com
(Top 10 hosts fraction of collection: 50%)
1. Temporal distribution - Publication & Content: collection of the dates in a collection:
“From August 2014–December 2015, the guidance was accessed online...The guidance
was retired on February 19, 2016, when more than 45 days had passed since Guinea
was declared free of Ebola virus transmission, because widespread human-to-human
transmission was at an end” Page last updated: December 27, 2017
CCS: NLM Ebola virus collection example
35
4. Content diversity: a value between 0 and 1 indicating the degree of
self-similarity of the text content of the collection
○ 0 - no diversity; duplicate documents
○ 1 - maximum diversity; documents without any common
vocabulary
Quantifying textual diversity in a collection
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
36
Content diversity example (colors = collections, numbers = stories)ID News Titles
Collections
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
Roy Moore Wins
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
Hurricane Harvey
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
Vegas Shooting
9 “Las Vegas shooting: What we know”
diversity scoresCollections
= 0.39
= 0.58
= 0.30
1 1 1 = 0.00
= 0.00
= 0.00
= 1.00
= 1.00
= 0.75
2 2 2
3 3 3
1 2 3
1 2 3
1 2 3
1 4 7
1 8 9
1 2 3 4 5 6 7 8 9
37
5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a
collection samples a single source, a handful of sources, or many sources.
There are multiple ways of measuring URL diversity
http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
38
6. Collection exposure - Archival rate and Tweet index rate: approximates popularity
● Archival rate: fraction of archived URIs in collection
● Tweet index rate: fraction of URIs in collection found embedded in tweets
7. Target audience: approximates target audience of a collection with readability scores
grade level - title - source
CCS: Approximating popularity and target audience
7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com
11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com
12th - “Harvey Puts Houston Underwater” - dailycaller.com
18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular
receptor” - nih.gov
20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
39
● We represented
each collection as an
n-dimensional vector
of CCS values.
● Calculated distance
between vectors.
Comparing collections with CCS
Doc-Term content diversity 0.86 0.89
List of entity set content diversity 0.65 0.85
URI diversity 1.00 0.98
Domain diversity 0.34 0.50
Hostname diversity 0.43 0.53
Social media rate 0.07 0.12
Archical rate 0.99 0.78
Tweet index rate 0.72 0.40
Exposure rate (reading level) 0.61 0.61
n-gram similarity of topic distribution 1.00 0.70
Normalized Euclidean distance 0.17
Archive-It Col. Reddit Col.CCS metrics
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
40
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
41
JCDL 2016: need for using local news sources to build collections for local
events.
Summary of completed research (2016)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
42
HyperText 2018: Introduced Collection Characterizing Suite for characterizing and
comparing collections
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
43
JCDL 2018: Investigated discoverability of URIs of news stories on SERPs
Summary of completed research (2017-2018)
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
44
● Studying SERPs: A Supervised Learning Algorithm for Binary Domain
Classification of Web Queries using SERPs (JCDL 2016 Poster)
● Interacting with Twitter
a. Extracting tweet conversations
b. Finding URLs on Twitter
● Extracting text from news documents
● Finding Storify stories
Outline of work that informed this research
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Outline
1. Introduction and Motivation
2. Research questions
3. Preliminary work
a. Bootstrapping Web Archive Collections from Social
Media
b. Collection Characterization
c. Summary of completed work
4. Proposed work
5. Conclusions
45
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
46
Schedule for pending research for 2018-2019
2018-06 2019-12
Identify hubs & authorities
in social media
2018-12
2018-06 - 2018-12
Candidacy proposal 2018-06 - 2018-12
Implement seed generation
system
2019-01 -
2019-03
2019-04 - 2018-08Evaluate seed generation
system
Dissertation/Defense
2019-09 - 2019-
12
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
Conclusions
47
● Archived collections offers a way of preserving the historic record of important events
and begin with seeds.
● We propose exploiting micro-collections on social media to augment or bootstrap
archived collections for stories and events.
○ Introduced the CCS for characterizing and comparing collections
○ We showed that micro-collections generated from from social media are similar to
Archive-It seeds
● Primary research tasks remaining:
○ Identify hubs and authorities in social media a method to evaluate quality at scale
○ Investigate what makes “good” seeds and implement/evaluate seed generation
system
@acnwala • JCDL 2018 • June 3, 2018
Bootstrapping Web Archive Collections
of Stories from Micro-collections in Social Media
@acnwala @webscidl
Thank you!

Contenu connexe

Tendances

Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Shawn Jones
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Shawn Jones
 

Tendances (20)

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
The Many Shapes of Archive-It
The Many Shapes of Archive-ItThe Many Shapes of Archive-It
The Many Shapes of Archive-It
 
Improving Collection Understanding in Web Archives
Improving Collection Understanding in Web ArchivesImproving Collection Understanding in Web Archives
Improving Collection Understanding in Web Archives
 
csvconfyasmin2017_05_03
csvconfyasmin2017_05_03csvconfyasmin2017_05_03
csvconfyasmin2017_05_03
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
Information Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-ItInformation Visualization - Visualizing Digital Collections at Archive-It
Information Visualization - Visualizing Digital Collections at Archive-It
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...Improving Understanding of Web Archive Collections Through Storytelling - PhD...
Improving Understanding of Web Archive Collections Through Storytelling - PhD...
 
Storytelling With Web Archives
Storytelling With Web ArchivesStorytelling With Web Archives
Storytelling With Web Archives
 
Combining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web ArchivesCombining Social Media Storytelling With Web Archives
Combining Social Media Storytelling With Web Archives
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
Social Cards Probably Provide For Better Understanding Of Web Archive Collect...
 
Where Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive CollectionsWhere Can We Post Stories Summarizing Web Archive Collections
Where Can We Post Stories Summarizing Web Archive Collections
 
Let's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library SystemLet's Get Visible! with Karla Smith, Winnefox Library System
Let's Get Visible! with Karla Smith, Winnefox Library System
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
The Off-Topic Memento Toolkit
The Off-Topic Memento ToolkitThe Off-Topic Memento Toolkit
The Off-Topic Memento Toolkit
 
Summarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniquesSummarizing archival collections using storytelling techniques
Summarizing archival collections using storytelling techniques
 
Combining Storytelling and Web Archives
Combining Storytelling and Web ArchivesCombining Storytelling and Web Archives
Combining Storytelling and Web Archives
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 

Similaire à Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

CARL ABRC social media environmental scan 2011
CARL ABRC social media environmental scan 2011CARL ABRC social media environmental scan 2011
CARL ABRC social media environmental scan 2011
CARLsurvey2010
 
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed ManagerCapturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Dan Chudnov
 
Building and Managing Social Media Collections
Building and Managing Social Media CollectionsBuilding and Managing Social Media Collections
Building and Managing Social Media Collections
Jason Casden
 
Doctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BLDoctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BL
Aquiles Alencar Brayner
 
557 ahn ppt exercise
557 ahn ppt exercise557 ahn ppt exercise
557 ahn ppt exercise
asoyoung
 
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
Frederick Zarndt
 
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
Cory Lampert
 
BL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - DatasetsBL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - Datasets
johnkayebl
 

Similaire à Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media (20)

The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
The web is a mess: how I learnt to stop worrying and love web archiving. Kris...
 
Digital collections: Increasing awareness and use
Digital collections:  Increasing awareness and useDigital collections:  Increasing awareness and use
Digital collections: Increasing awareness and use
 
Preserving Streams of Issued Content
Preserving Streams of Issued ContentPreserving Streams of Issued Content
Preserving Streams of Issued Content
 
CARL ABRC social media environmental scan 2011
CARL ABRC social media environmental scan 2011CARL ABRC social media environmental scan 2011
CARL ABRC social media environmental scan 2011
 
Scraping SERPs For Archival Seeds - It Matters When You Start
Scraping SERPs For Archival Seeds - It Matters When You StartScraping SERPs For Archival Seeds - It Matters When You Start
Scraping SERPs For Archival Seeds - It Matters When You Start
 
AL Live—Libraries and COVID-19: Considering Copyright During a Crisis, Part 2...
AL Live—Libraries and COVID-19: Considering Copyright During a Crisis, Part 2...AL Live—Libraries and COVID-19: Considering Copyright During a Crisis, Part 2...
AL Live—Libraries and COVID-19: Considering Copyright During a Crisis, Part 2...
 
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed ManagerCapturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
 
A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015A new research agenda for Wikimedia – Big Dive 2015
A new research agenda for Wikimedia – Big Dive 2015
 
Building and Managing Social Media Collections
Building and Managing Social Media CollectionsBuilding and Managing Social Media Collections
Building and Managing Social Media Collections
 
BL_English doctoral_open_day_session
BL_English doctoral_open_day_sessionBL_English doctoral_open_day_session
BL_English doctoral_open_day_session
 
Emerging Technologies for Libraries and Librarians, 2013
Emerging Technologies for Libraries and Librarians, 2013Emerging Technologies for Libraries and Librarians, 2013
Emerging Technologies for Libraries and Librarians, 2013
 
Doctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BLDoctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BL
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 
Crowdsourcing 2010 05_05
Crowdsourcing 2010 05_05Crowdsourcing 2010 05_05
Crowdsourcing 2010 05_05
 
557 ahn ppt exercise
557 ahn ppt exercise557 ahn ppt exercise
557 ahn ppt exercise
 
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
20130630 What motivates library crowdsourcing volunteers? [ALA LITA]
 
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
NEVADA AND LAS VEGAS MEMORY: DIGITAL TREASURES FOR READERS, AUTHORS AND THE L...
 
BL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - DatasetsBL Social Sciences Post Graduate Training Day - Datasets
BL Social Sciences Post Graduate Training Day - Datasets
 
Learn to speak open
Learn to speak openLearn to speak open
Learn to speak open
 
Crowdsourcing as productive engagement with cultural heritage
Crowdsourcing as productive engagement with cultural heritageCrowdsourcing as productive engagement with cultural heritage
Crowdsourcing as productive engagement with cultural heritage
 

Plus de Alexander Nwala (6)

Local Memory Project
Local Memory ProjectLocal Memory Project
Local Memory Project
 
Tweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation GraphTweet Visibility Dynamics in a Tweet Conversation Graph
Tweet Visibility Dynamics in a Tweet Conversation Graph
 
Generating collections for stories and events
Generating collections for stories and eventsGenerating collections for stories and events
Generating collections for stories and events
 
Jcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankovaJcdl2016_keynote-zemankova
Jcdl2016_keynote-zemankova
 
Tracking discourse on social media
Tracking discourse on social mediaTracking discourse on social media
Tracking discourse on social media
 
Information Visualization Project
Information Visualization ProjectInformation Visualization Project
Information Visualization Project
 

Dernier

💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
Cara Menggugurkan Kandungan 087776558899
 
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
Cara Menggugurkan Kandungan 087776558899
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
ZurliaSoop
 
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdfSociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
SocioCosmos
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdf
eliklein8
 
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
ZurliaSoop
 
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
Heena Escort Service
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
Health
 
Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdf
eliklein8
 

Dernier (17)

SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdfSEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
SEO Expert in USA - 5 Ways to Improve Your Local Ranking - Macaw Digital.pdf
 
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
💊💊 OBAT PENGGUGUR KANDUNGAN SEMARANG 087776-558899 ABORSI KLINIK SEMARANG
 
Marketing Plan - Social Media. The Sparks Foundation
Marketing Plan -  Social Media. The Sparks FoundationMarketing Plan -  Social Media. The Sparks Foundation
Marketing Plan - Social Media. The Sparks Foundation
 
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
JUAL PILL CYTOTEC PALOPO SULAWESI 087776558899 OBAT PENGGUGUR KANDUNGAN PALOP...
 
Content strategy : Content empire and cash in
Content strategy : Content empire and cash inContent strategy : Content empire and cash in
Content strategy : Content empire and cash in
 
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
Jual Obat Aborsi Palu ( Taiwan No.1 ) 085657271886 Obat Penggugur Kandungan C...
 
Enhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content MarketingEnhancing Consumer Trust Through Strategic Content Marketing
Enhancing Consumer Trust Through Strategic Content Marketing
 
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdfSociocosmos empowers you to go trendy on social media with a few clicks..pdf
Sociocosmos empowers you to go trendy on social media with a few clicks..pdf
 
Capstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdfCapstone slidedeck for my capstone final edition.pdf
Capstone slidedeck for my capstone final edition.pdf
 
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Kudus ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cy...
 
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIRBVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
BVG BEACH CLEANING PROJECTS- ORISSA , ANDAMAN, PORT BLAIR
 
The Butterfly Effect
The Butterfly EffectThe Butterfly Effect
The Butterfly Effect
 
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
Meet Incall & Out Escort Service in D -9634446618 | #escort Service in GTB Na...
 
Capstone slide deck on the TikTok revolution
Capstone slide deck on the TikTok revolutionCapstone slide deck on the TikTok revolution
Capstone slide deck on the TikTok revolution
 
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic HappensIgnite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
Ignite Your Online Influence: Sociocosmos - Where Social Media Magic Happens
 
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
 
Capstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdfCapstone slidedeck for my capstone project part 2.pdf
Capstone slidedeck for my capstone project part 2.pdf
 

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media

  • 1. Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media 1 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 2. Alexander C. Nwala Supervisor: Michael L. Nelson and Co-supervisor: Michele C. Weigle Old Dominion University Web Science & Digital Libraries Research Group @acnwala • @WebSciDL Joint Conference on Digital Libraries (JCDL) Doctoral Consortium June 3, 2018, Fort Worth, TX This work was made possible in part by IMLS LG-71-15-0077-15 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media 2 Thank you SIGIR for the Travel Grant @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 3. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 3 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 4. In March 2014, there was a serious outbreak of Ebola in West Africa 1 https://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/ 4 The outbreak severely affected Guinea, Liberia, and Sierra Leone with about 11,000 deaths1. http://wayback.archive-it.org/4887/20141028153039/http://blogs.msf.org/en/staff/blogs/msf-ebola-blog @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 5. 5 http://wayback.archive-it.org/4887/20141022093244/http://blog.usaid.gov/ebola/ A few months after the Ebola outbreak, an Archivist at the National Library of Medicine (NLM) collected seeds on Archive-It for the Ebola virus outbreak. @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 6. 6 Archive-It Ebola virus seeds http://www.niaid.nih.gov/topics/ebolaMarburg/understandingEbola/ http://www.cdc.gov/vhf/ebola/pdf/facts-about-ebola-french.pdf http://www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/previous-updates.html http://www.cdc.gov/vhf/ebola/french/2014-west-africa/index.html http://www.cdc.gov/vhf/ebola/exposure/monitoring-and-movement-of-persons-with-exposure.html http://www.cdc.gov/mmwr/ebola_reports.html http://www.cdc.gov/media/DPK/2014/dpk-ebola-outbreak.html http://www.acf.hhs.gov/programs/ohsepr/resource/ebola-planning-considerations http://healthycanadians.gc.ca/diseases-conditions-maladies-affections/disease-maladie/ebola/index-eng.php http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/whats-new.html http://espanol.cdc.gov/enes/vhf/ebola/outbreaks/2014-west-africa/index.html http://nypost.com/2014/10/29/ebola-doctor-lied-about-his-nyc-travels-police/ http://www.npr.org/blogs/goatsandsoda/2014/08/20/341869218/if-salt-n-pepa-told-you-to-brush-your-teeth-youd-surely-listen/ http://www.doctorswithoutborders.org/article/msf-protocols-staff-returning-ebola-affected-countries/ http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground-zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ http://www.philly.com/philly/blogs/public_health/Yellow-fever-and-Ebola-similar-scourges-centuries-apart.html http://www.philly.com/philly/blogs/public_health/Syracuse-University-can-teach-us-a-lot-about-Ebola-panic.html http://www.philly.com/philly/blogs/public_health/Why-not-ban-travel-to-stop-Ebola.html http://www.pressherald.com/2014/10/17/fearing-ebola-strong-elementary-teacher-on-leave-after-traveling-to-dallas/ http://www.politico.com/magazine/story/2014/10/how-the-media-stoked-ebola-panic-112095.html http://www.washingtonpost.com/news/post-nation/wp/2014/10/27/nurse-detained-under-new-jerseys-ebola-quarantine-to-be-released/?hpid=z1 http://blogs.scientificamerican.com/doing-good-science/2014/10/31/ebola-abundant-caution-and-sharing-a-world/ http://www.scientificamerican.com/article/ebola-exacerbates-west-africa-s-poverty-crisis/?WT.mc_id=SA_WR_20141105 http://www.scientificamerican.com/article/let-s-talk-about-ebola-survivors-and-sex/?WT.mc_id=SA_WR_20141105 http://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/ http://federalsoup.com/articles/2014/10/31/army-to-set-up-ebola-testing-labs-in-liberia.aspx http://blogs.plos.org/speakingofmedicine/2014/10/22/ebola-taught-us-crucial-lesson-views-irrational-health-behaviors/ http://blogs.plos.org/speakingofmedicine/2014/10/31/social-pathways-ebola-virus-disease-rural-sierra-leone-implications-containment/ http://blogs.plos.org/speakingofmedicine/2014/10/31/rapid-response-ebola/ Sample of Archive-It Ebola virus seeds
  • 7. ● A seed list is an initial collection exemplar web pages for a topic ○ seeds + linked pages form a collection when crawled ● Archived web collections consist of groups of web pages that share a common topic e.g., “Ebola virus” and “2018 Winter Olympics.” ● Human-generated seeds are high-quality, but expensive to generate 7 Archived web collections begin with seeds @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 8. Archived web collections offer a way of preserving the historic record of important events 8 http://xhosaculture.co.za/ Mandela’s legacy https://www.wsj.com/ 2016 Dakota Access Pipeline http://www.nj.com/ 2018 Winter Olympics http://xhosaculture.co.za/ Mandela’s legacy @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 9. ● The Internet Archive and Archive-It (a service of the Internet Archive) have on multiple occasions requested that users submit seeds via Google Docs for: 9 Seeds may be generated by multiple users
  • 10. 10 Sample seeds contributed for the Boston Marathon Bombing Collection
  • 11. 12 Users on social media share stories that include hand-selected URIs ● The Wikipedia page about the Stoneman Douglas High School shooting ● created the same day as the shooting event (February 14, 2018) ● We consider Wikipedia references an example of a Micro-collection We propose extracting URIs from micro-collections such as Wikipedia references to generate seeds
  • 12. 13 More micro-collections: extract URIs from Twitter Moments to generate seeds: ● Stoneman Douglas High School shooting Twitter Moment created the day after event
  • 13. 14 Storify story published Jan 2014: “Protests In Kiev Turn Violent,” before the major event: Russian annexation of Crimea (started late February 2014) Micro-collections often start early before major events
  • 14. 15 Archive-It collection for the event potentially omits some of the prelusive contents in the Storify micro-collection Micro-collections may include prelusive events lacking in collections triggered by major events Storify story of the Ukrainian crisis event (January 2014) highlights riots before Russian annexation of Crimea (late February).
  • 15. We propose extracting URIs from social media micro-collections to bootstrap archived collections or augment curator-selected seeds. 16 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 16. http://blogs.plos.org/dnascience/2014/11/06/eman-reports-ebola-ground- zero/ http://blogs.plos.org/globalhealth/2014/11/ebola_and_human_rights/ https://www.cdc.gov/vhf/ebola/history/2014-2016-outbreak/index.html https://twiter.com/ebola_response/ https://www.who.int/mediacentre/news/ebola/en/ http://allafrica.com/stories/201407310957.html http://america.aljazeera.com/articles/2014/8/1/ebola-explainer.html http://jid.oxfordjournals.org/content/204/suppl_3/S785_long http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0005804 https://www.youtube.com/watch?v=XasTcDsDfMg&feature=youtu.be https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074192 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4313106 http://apps.who.int/gho/data/view.ebola-sitrep.ebola-summary-latest http://www.who.int/csr/disease/ebola/en/ http://www.nature.com/articles/nature10348 17 Sample of seed URIs for Ebola virus topic Reddit SERP and comments Archive-It seeds Wikipedia references Micro-collections
  • 17. Taking the effort to create micro-collections is an indication of editorial effort, and thus presumably quality of the seeds. 18 Wikipedia references for Stoneman Douglas High School Shooting
  • 18. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 19 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 19. ● Research question 1: ○ Are seeds that are generated automatically from micro-collections in social media comparable to curator-generated seeds? ○ What quantitative method(s) can be used to compare collections? ● Research question 2: ○ If we consider curator hand-selected seeds the gold standard for collections, could this lead to the definition of what makes a collection good? ○ How do we assess the quality of collections at scale? 20 Primary research questions @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 20. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 21 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 21. ● We implemented a prototype system for generating seeds from the following social media sites: ○ Storify (out of service since May 16, 2018), ○ Twitter Moments, ○ Reddit, and ○ Wikipedia ● We also generated seeds from the Google SERP as a baseline to compare social media micro-collections since we believe SERPs are a primary source of discovering seeds. 22 Generating seed URIs from social media
  • 22. 23 Social media micro-collection were similar to Archive-It seeds ≈ Euclidean distance range between collections: 0.17 to
  • 23. ● Storify was a social media curation service that enables users to create stories that consist of hand-selected web resources such as: ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Storify stories ○ Unfortunately, Storify went out of service in May 2018 ○ http://ws- dl.blogspot.com/2017/08/2017-08- 11-where-can-we-post-stories.html 24 Generating seeds from Storify
  • 24. ● Twitter Moments is a service by Twitter that lets users create topical collections of tweets. ● Tweets in Twitter Moments embed ○ URIs of news articles, images, videos, etc. ○ Seeds = URIs in Twitter Moments 25 Generating seeds from Twitter Moments
  • 25. ● Reddit is a service that allows users to post URIs for various topics. ● Reddit users rate the URIs and post comments that may also include URIs ○ Seeds = URIs in Reddit pages and comments 26 Generating seeds from Reddit @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 26. ● Wikipedia is a service that enables multiple contributors to create documents about various topics ranging from politics to science and technology ● The references of Wikipedia documents include URIs relevant to the document topic ○ Seeds = URIs in Wikipedia references 27 Generating seeds from Wikipedia Wikipedia references for Stoneman Douglas High School Shooting
  • 27. 28 Not all micro-collections yield high quality seeds: How do we recognize low quality seeds at scale? Spam links in tweetsHijacked hashtagCan we assess authority of source? Infowars @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 28. 29 Example of potential seeds generated from the Google SERP for query: “hurricane harvey” Seeds generated from SERPs can be used as a baseline to compare social media micro-collections @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 29. 31 ● Daily prob. of finding the URI of the same story: 0.34 - 0.44, weekly: 0.01 - 0.11, and monthly rate: 0.01 - 0.08 The probability of finding the URI of a news story diminishes with time
  • 30. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 32 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 31. 33 ● It requires comparing collections that may cater to different needs ● We explored foundational work in collection characterization from Library and Web Sciences ○ Defined a suite of 7 measures (Collection Characterizing Suite - CCS) ● The CCS is used to describe individual collections and compare multiple collections Characterizing or comparing collections is a challenging task @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 32. 34 1. Distribution of topics: a ranked list of topics in a collection a. “ebola outbreak west africa” b. “guinea liberia sierra leone” c. “cases ebola virus disease” d. “public health workers” e. “centers disease control prevention” 1. Distribution of sources (hostnames): a statistical summary of the various sources sampled in order to build the collection: a. 18 (12.5%) web pages from blogs.plos.org, b. 14 (9.7%) from cdc.gov, and c. 11 (7.6%) from twitter.com (Top 10 hosts fraction of collection: 50%) 1. Temporal distribution - Publication & Content: collection of the dates in a collection: “From August 2014–December 2015, the guidance was accessed online...The guidance was retired on February 19, 2016, when more than 45 days had passed since Guinea was declared free of Ebola virus transmission, because widespread human-to-human transmission was at an end” Page last updated: December 27, 2017 CCS: NLM Ebola virus collection example
  • 33. 35 4. Content diversity: a value between 0 and 1 indicating the degree of self-similarity of the text content of the collection ○ 0 - no diversity; duplicate documents ○ 1 - maximum diversity; documents without any common vocabulary Quantifying textual diversity in a collection @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 34. 36 Content diversity example (colors = collections, numbers = stories)ID News Titles Collections 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” Roy Moore Wins 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” Hurricane Harvey 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” Vegas Shooting 9 “Las Vegas shooting: What we know” diversity scoresCollections = 0.39 = 0.58 = 0.30 1 1 1 = 0.00 = 0.00 = 0.00 = 1.00 = 1.00 = 0.75 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 1 4 7 1 8 9 1 2 3 4 5 6 7 8 9
  • 35. 37 5. Source diversity - URI, Domain, Hostname, and Social media: indicates whether a collection samples a single source, a handful of sources, or many sources. There are multiple ways of measuring URL diversity http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html
  • 36. 38 6. Collection exposure - Archival rate and Tweet index rate: approximates popularity ● Archival rate: fraction of archived URIs in collection ● Tweet index rate: fraction of URIs in collection found embedded in tweets 7. Target audience: approximates target audience of a collection with readability scores grade level - title - source CCS: Approximating popularity and target audience 7th - “History of hurricanes in Texas, by the numbers” - abcnews.go.com 11th - “Trump faces leadership test with Hurricane Harvey” - thehill.com 12th - “Harvey Puts Houston Underwater” - dailycaller.com 18th (graduate) - “Ebola virus entry requires the host-programmed recognition of an intracellular receptor” - nih.gov 20th (graduate) - “Virus taxonomy classification and nomenclature of viruses”- sciencemag.org @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 37. 39 ● We represented each collection as an n-dimensional vector of CCS values. ● Calculated distance between vectors. Comparing collections with CCS Doc-Term content diversity 0.86 0.89 List of entity set content diversity 0.65 0.85 URI diversity 1.00 0.98 Domain diversity 0.34 0.50 Hostname diversity 0.43 0.53 Social media rate 0.07 0.12 Archical rate 0.99 0.78 Tweet index rate 0.72 0.40 Exposure rate (reading level) 0.61 0.61 n-gram similarity of topic distribution 1.00 0.70 Normalized Euclidean distance 0.17 Archive-It Col. Reddit Col.CCS metrics
  • 38. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 40 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 39. 41 JCDL 2016: need for using local news sources to build collections for local events. Summary of completed research (2016) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 40. 42 HyperText 2018: Introduced Collection Characterizing Suite for characterizing and comparing collections Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 41. 43 JCDL 2018: Investigated discoverability of URIs of news stories on SERPs Summary of completed research (2017-2018) @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 42. 44 ● Studying SERPs: A Supervised Learning Algorithm for Binary Domain Classification of Web Queries using SERPs (JCDL 2016 Poster) ● Interacting with Twitter a. Extracting tweet conversations b. Finding URLs on Twitter ● Extracting text from news documents ● Finding Storify stories Outline of work that informed this research @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 43. Outline 1. Introduction and Motivation 2. Research questions 3. Preliminary work a. Bootstrapping Web Archive Collections from Social Media b. Collection Characterization c. Summary of completed work 4. Proposed work 5. Conclusions 45 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 44. 46 Schedule for pending research for 2018-2019 2018-06 2019-12 Identify hubs & authorities in social media 2018-12 2018-06 - 2018-12 Candidacy proposal 2018-06 - 2018-12 Implement seed generation system 2019-01 - 2019-03 2019-04 - 2018-08Evaluate seed generation system Dissertation/Defense 2019-09 - 2019- 12 @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media
  • 45. Conclusions 47 ● Archived collections offers a way of preserving the historic record of important events and begin with seeds. ● We propose exploiting micro-collections on social media to augment or bootstrap archived collections for stories and events. ○ Introduced the CCS for characterizing and comparing collections ○ We showed that micro-collections generated from from social media are similar to Archive-It seeds ● Primary research tasks remaining: ○ Identify hubs and authorities in social media a method to evaluate quality at scale ○ Investigate what makes “good” seeds and implement/evaluate seed generation system @acnwala • JCDL 2018 • June 3, 2018 Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media @acnwala @webscidl Thank you!

Notes de l'éditeur

  1. https://twitter.com/tahDeetz/status/494886192299536385 https://twitter.com/SpaceCoastMetal/status/495589749051363328