1. Providing tools to build collections of stories for local events from local sources
1
2. Th is work was made possible in part by IMLS LG-71-15-0077-15 and
support from the Harvard Law School Library. We are grateful for the support.
Local Memory Project (LMP)
http://www.localmemory.org/, https://twitter.com/localmem
Alexander C. Nwala, Michele C. Weigle, and Michael L. Nelson
@webscidl
Old Dominion University
Adam B. Ziegler and Anastasia Aizman
@harvardlil
Harvard Library Innovation Lab
Presented by: Alexander C. Nwala (@acnwala)
Computer Science Ph.D student
Media Cloud Intern, Berkman Klein Center for Internet & Society, Harvard University
JCDL 2017, June 21, 2017
2
3. LMP: Outline
1. Introduction
2. LMP local stories collection building
a. Geo: Nearby news media discovery
b. Chrome Extension: Collection building
c. Collection archiving
d. Community collection building
3. Evaluation
a. Dataset
b. Metrics/Results
4. Conclusions
3
4. Local Michigan media first reported on the Flint water changeover in 2014
http://www.mlive.com/opinion/flint/index.ssf/2014/04/editorial_switch_to_flint_rive.html
● April 2014: Officials in Flint, Michigan switched
the city’s water source from Lake Huron (Detroit
water system) to the Flint River
● This news was reported by local media such as
Michigan Radio, the Flint Journal-MLive, and
local TV affiliates in Flint (WEYI, WJRT, WSMH,
and WNEM)1
1 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis. h ttps://mediama tters.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290. (2016).
4
5. http://www.mlive.com/news/flint/index.ssf/2014/05/state_says_flint_river_water_m.html
● May 23, 2014: City residents complained about
the water’s taste and smell
● This news was reported Ron Fonger of Flint
Journal-MLive reported (local media)2
2 Ron Fonger. 2014. State says Flint River water meets all standards but more than twice the hardness of lake water. h ttp://www.mlive.com/news/ int/index.ssf/2014/05/state_says_fl int_river_water_m.html. (2014).
City residents complained about the water’s taste and smell
5
6. Between August and September 2014: the city issued three boil advisories to residents of Flint
after finding fecal coliform bacteria (E. coli) in the water1
http://www.mlive.com/news/flint/index.ssf/2014/09/flint_says_drinking_water_advi
.html
http://www.mlive.com/news/flint/index.ssf/2014/09/flint_lifts_boil_water_advisor.html
http://www.mlive.com/news/flint/index.ssf/2014/09/flint_flushes_out_latest_water.ht
ml
Flint issues three boil advisories after finding E. coli in the water
6
1 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis. h ttps://mediamatt ers.org/research/2016/02/ 02/analysis-how-michigan-and-national-reporters-co/208290. (2016).
7. January 5, 2016: Governor Rick Snyder declared a state of emergency for the city of Flint, due to
dangerously high levels of lead contamination in the drinking water
https://www.democracynow.org/2016/1/8/poisoned_democracy_how_an_unelected_official
January 2016, Governor Rick Snyder declared a state of emergency for Flint
7
8. ● A chain of events about the Flint water crisis was reported by local media, but most of the non-local media did
not report this crucial story until 2016.1
● Local media is fundamental to journalism, but is in decline.3
LMP attempts to shed some light on local media
1 Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis. h ttps://mediamatt ers.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290. (2016).
3 Rasmus Kleis Nielsen. 2015. Local journalism: the decline of newspapers and the rise of digital media. IB Tauris.
Non-local media did not report this crucial story until 2016
https://cloudfront.mediamatters.org/static/uploader/image/2016/02/03/flinttimeline1.png
8
9. Local and non-Local media have different priorities
Non-local news organizations such as CNN cover
stories of a broader (national/international) scope such
as Obamacare and the Syrian refugee migrant crisis
Local media such as the Caloosa Belle Newspaper
(LaBelle, FL) cover stories that would not naturally
be of interest to another locality, such as the annual
Swamp Cabbage Festival
http://caloosabelle.com/?s=swamp+cabbage http://www.cnn.com/specials/world/migration-crisis
9
10. LMP: Introduction
LMP provides a suite of tools (beginning with two) to help users and
small communities discover, collect, build, archive, and share collections
of stories for important local events by leveraging local news sources
10
11. LMP: Outline
1. Introduction
2. LMP local stories collection building
a. Geo: Nearby news media discovery
b. Chrome Extension: Collection building
c. Collection archiving
d. Community collection building
3. Evaluation
a. Dataset
b. Metrics/Results
4. Conclusions
11
12. Geo: Nearby news media discovery
● Given a zip code, Geo, returns a list of newspapers, TV, and radio stations
in order of proximity to location associated with the zip code.
● For example, given the zip code: “23529” (Norfolk Virginia, USA), here is a
list of 10 news media for Norfolk:
12
13. Geo: Nearby news media discovery
● For example, given the zip code: “23529” (Norfolk Virginia, USA), here is a
list of 10 news media for Norfolk (JSON):
13
14. Geo: Nearby news media discovery
● US local news repository:
○ 5,992 Newspapers
○ 1,061 TV stations, and
○ 2,539 Radio stations
■ Scraped from
http://www.usnpl.com/
● Non-US local news repository:
○ 6,638 Newspapers
○ 183 Countries
○ 3,151 Cities
■ Scraped from
https://www.thepaperboy.com/
14
15. SearchEngine(q = “protesters and police site:whro.org”)
...
SearchEngine(q = “protesters and police site:pilotonline.com”)
SearchEngine(q = “protesters and police site:wtkr.com”)
Chrome Extension: Collection building
15
16. Local Stories for Query: "protesters and police", for
23529 (Norfolk VA, USA).
Chrome Extension: Collection building
16
17. Non-LocalLocal vs
Local news sources from Virginia, such: Virginia
Pilot, WHRO-TV, and WTKR-TV
Non-Local sources (e.g., CNN and NBC News),
and Local sources (e.g., ABC7 Chicago and
Chicago Tribune), and a Youtube source
A non-Local collection mixes Local and non-Local sources
Chrome Extension: Collection building
17
18. To mitigate the problems of content drift and link rot, as well as preserve
collections for future users and researchers, the LMP extension
implements collection archiving
18
21. Community collection building
● We believe there is value when multiple users contribute to the same collection
● This is similar in spirit to the Internet Archive’s request to the public to contribute URIs for
the 2016 Orlando Nightclub Shooting Web Archive:
21
22. The LMP Extension enables users to share collections on Twitter
● We believe there is value when multiple users contribute to the
same collection
● The LMP Extension enables users to share collections on Twitter.
Shared collections may be tagged with a hashtag
● The hashtag provides a means for thematically-related collections to
be organized
22
23. The hashtag provides a means for thematically-related collections to be organized
23
24. LMP: Outline
1. Introduction
2. LMP local stories collection building
a. Geo: Nearby news media discovery
b. Chrome Extension: Collection building
c. Collection archiving
d. Community collection building
3. Evaluation
a. Dataset
b. Metrics/Results
4. Conclusions
24
25. Evaluation
● We claim that Local collections have less exposure compared to
non-Local collections
● Through collection building, archiving, and sharing, LMP could
facilitate the increase of exposure of Local news sources
● To assess the validity of our claim, we measured the degree of
exposure Local collections have compared to non-Local collections
25
26. Evaluation: Dataset
● Our evaluation dataset comprised of 20 pairs (Local and non-Local) of collections
corresponding to 20 different stories
● Each collection (Local
and non-Local) was
further split into two
classes:
○ G - extracted from
the default Google
SERP, and
○ NV - extracted from
the Google News
vertical SERP
G NV
26
27. Evaluation: Dataset
● Our evaluation dataset comprised of 20 pairs (Local and non-Local)
of collections corresponding to 20 different stories
● Each collection (Local
and non-Local) was
further split into two
classes:
○ G - extracted from
the default Google
SERP, and
○ NV - extracted from
the Google News
vertical SERP
27
28. Evaluation: Dataset (cont’d)
● Our evaluation dataset comprised of 20 pairs (Local and non-Local)
of collections corresponding to 20 different stories
● Each collection (Local
and non-Local) was
further split into two
classes:
○ G - extracted from
the default Google
SERP, and
○ NV - extracted from
the Google News
vertical SERP
28
29. Evaluation: Metrics
● For each collection we measured:
○ Archival coverage and tweet index rate to approximate the
exposure of the Local and non-Local collections
● We also measured:
○ Temporal range,
○ Precision, and
○ Sub-collection overlap for experimentation
29
30. Archival coverage: Non-Local collections produced higher archive rates
than Local collections (claim confirmed)
● Definition: The archival coverage is the fraction of a
collection that is archived
● Claim: We claim that non-Local collections possess
higher archive rates than Local collections
● Extraction: The binary archived state of a story in a
collection was extracted by utilizing the MemGator
utility (http://memgator.cs.odu.edu/)
● Result:
○ Non-Local collections G and NV produced
archive rates of 0.83 and 0.80, respectively
○ Local collections G and NV produced archive
rates of 0.52 and 0.63, respectively
30
31. Tweet index rates: Non-Local collections produced higher tweet index
rates than Local collections (claim confirmed)
● Definition: The tweet index rate is the fraction of a
collection which could also be found embedded in
a tweet
● Claim: We claim that non-Local collections possess
higher tweet index rates than Local collections
● Extraction: The binary tweet index state of a story
in the collection was extracted by searching Twitter
● Result:
○ Non-Local collections G and NV produced
tweet index rates of 0.71 and 0.80,
respectively
○ Local collections G and NV produced tweet
index rates of 0.44 and 0.59, respectively
31
32. Temporal range: Non-Local-NV collections possessed the highest
probability of producing the newest document with a probability of 0.75
(claim confirmed)
● Definition: the temporal range of a collection is the
distribution of the creation datestamps of the stories in
the collection
● Claim: We claim that non-Local collections are
temporally biased to produce newer stories than Local
collections
● Extraction: Most news stories have creation
datestamps. We extracted these datestamps from the
SERPs
● Result:
○ Local-G collections produce the oldest
documents with a probability of 0.7
○ The consequences of these probabilities are
crucial: One must sample Local-G collections in
order to maximize the chances of finding the first
reports about a story or event 32
33. Precision: Type-G collections produce documents at a higher precision
than NV (claim partially confirmed)
● Definition: The precision of a collection is the fraction of
stories in the collection that are relevant to the collection
query based on the judgement of a human evaluator. We
considered a story relevant or non-relevant only if the
relevance score was by a margin of 2 votes or more
● Claim: We claim that non-Local collections possess a
higher precision than Local collections
● Extraction: 14 evaluators evaluated our dataset. For each
story in a collection, an evaluator scored the story as
relevant if the story was on topic with respect to the
collection query, and non-relevant otherwise
● Result
○ Local-G precision: 0.84, non-Local-G: 0.72,
Local-NV: 0.71, and non-Local-NV: 0.68
Relevance Margin of 2 Vote or more
33
34. Precision: Type-G collections produce documents at a higher precision
than NV (claim partially confirmed)
● Result
○ non-Local-G precision: 0.84, Local-G: 0.79,
non-Local-NV: 0.71, and Local-NV: 0.70
Relevance Margin of 1 Vote or more
34
35. Sub-collection overlap: Local collections showed a higher overlap rate
than non-Local collection (claim confirmed)
● Definition:
○ Given a collection evaluation dataset, let
sub-collection sets LG
and LNV
define sets populated
from Local-G and Local-NV, respectively
○ Similarly, let sub-collection sets NLG
and NLNV
define
sets populated from non-Local-G and non-Local-NV,
respectively
○ The overlap of 2 sets X, Y, overlap(X, Y) =
● Claim: We claim Local sub-collections LG
and LNV
have
more in common (more overlap) than non-Local
sub-collections NLG
and NLNV
● Result: Local collections showed a higher overlap rate than
non-Local collection
35
e1: Local collections overlap
e2: Non-Local collections overlap
e3: e1 and e2 overlap
36. LMP: Outline
1. Introduction
2. LMP local stories collection building
a. Geo: Nearby news media discovery
b. Chrome Extension: Collection building
c. Collection archiving
d. Community collection building
3. Evaluation
a. Dataset
b. Metrics/Results
4. Conclusions
36
37. Conclusions
● We cannot rely exclusively on non-Local sources to build our
collections
● Local news sources are fundamental to journalism, but less exposed
● LMP’s tools could help expose local news source
○ Geo (http://www.localmemory.org/geo/)
○ Chrome Extension - Local stories collection generator
(http://www.localmemory.org/)
● Our tools, local news repository, and evaluation results are publicly
available (https://github.com/harvard-lil/local-memory)
37