JSTOR Labs is developing a new tool to help baseball researchers by building a knowledge graph connecting people, places, organizations, and events mentioned in a collection of baseball-related articles and data sources. They conducted a flash build at the Library of Congress to develop an initial prototype by extracting entities from articles, linking them to Wikidata records about baseball, and incorporating additional data sources. The tool will be released as a proof of concept on labs.jstor.org and has potential to be expanded by adding more data, correcting errors through crowdsourcing, and expanding beyond baseball in the future.
Cultural History Baseball Cards: Flash-building a New Tool for Baseball Researchers
1. Flash-building a New Tool
for Baseball Researchers
Alex Humphreys
@abhumphreys
July 13, 2018
Cultural History
Baseball Cards
2. ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
3. JSTOR Labs works with partner publishers,
libraries and labs to create tools for
researchers, teachers and students that are
immediately useful – and a little bit magical.
16. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
OUR APPROACH:
July 9-13 Flash build
at Library of Congress
17. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
+
18. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
• Selected articles in JSTOR about baseball (~25k) using topic model
• Ran entity extraction (people, places, organizations)
• Limited entities to Wikidata records associated with baseball
• Created knowledge graph
• Incorporated data from Wikidata, LOC and NMAAHC into
knowledge graph
19. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
20. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
21. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
22. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
23. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
24. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
25. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
26. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
27. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
28.
29.
30.
31.
32.
33.
34.
35.
36. 1. Create
the sandbox
2. Research
& data prep
3. Design
jam
4. Select an
approach
5. Refine
approach 6. Release
In the next few weeks:
• Bug fixing, etc.
• Release as a proof of concept, available on labs.jstor.org
Further down the road, there’s opportunity to:
• Add new data & content to the knowledge graph
• Incorporate crowd-sourcing to correct errors in knowledge graph
• Expand beyond baseball