This document summarizes JSTOR's efforts to develop a sustainability collection and utilize their controlled vocabulary thesaurus (JTHES) to semantically index and organize content in the collection. Subject matter experts were enlisted to review and provide feedback on key sustainability terms. A prototype sustainability portal was created that uses the thesaurus and semantic indexing to power auto-suggest, discovery, and refinement of results. Refinements are ongoing, including automating the calculation of documents' sustainability scores and topic labeling.
2. Overview
Sustainability collection defined
Utilization of the thesaurus within the sustainability collection
Subject matter experts enlisted
Results
Live demo
3. JSTOR- a quick primer
3,200+ journals & 30,000+ books
9.3 million full length articles
70 million pages
2.9 million book reviews
138 million content accesses in
2013
100 million searches per year
http://www.jstor.org/
4. Sustainability Collection: what will it
be?
Driver: Emerging interdisciplinary
area that JSTOR wanted to
support in both research and
teaching needs.
Core topics of Cities and
Urbanization, Food and
Agriculture, Industrial Ecology,
Resource Economics, Forestry and
Land Use and Environmental
Policy and Law
Composed of journals, books,
grey literature (working reports,
research reports, technical reports
etc.)
Specialized functionality to
support research by including
semantic indexing to help
researchers locate related terms
and concepts. This is where the
JSTOR Thesaurus (JTHES) comes
into play!
7. The challenge
To assemble a list of key terms in Sustainability
The terms will be used to organize and tag sustainability-related research articles
on JSTOR starting in 2015.
These terms will also be used for an auto complete function in the search
component.
Utilize the JTHES in a live prototype
This was the first project where we looked at how to use the thesaurus as an
intelligence layer within a collection. How should it work? How do we do this?
8. How do we get this done? The
options…
Create a new thesaurus for sustainability:
Pros: Specific to sustainability
Cons: Remembering to make changes in more than one place. Cost associated with
creating and maintaining a separate thesaurus
Create a sustainability branch within JTHES:
Pros: Could BT (Broader terms) all relevant branches and terms from elsewhere in the
JTHES into 1 branch
Cons: Redundant; Multiple BT’s clutter up the JTHES
Create a facet to tag terms within JTHES as “Sustainability”:
Pros: Creates a flat list (in faceted view) of all of the terms in that facet; Easy to
maintain
Cons: Does not show a hierarchy; Cannot have multiple facets
9. The road to sustainability…
Research: examined existing
glossaries and thesauri created
by research libraries, discipline
associations and individual
scholars in each of the disciplines.
Existing terms (pulling lists)
Existing branches (clean up)
Adding new terms
Adding new branches: Food
studies, Urban studies, etc.
Constructing new rules and
refining existing rules
Testing content
10. Enlisting Subject matter experts
Contacted faculty members in ten disciplines to go over the subset of terms
assembled in their discipline and review those terms with an eye toward:
Is this how people in the field express this concept?
Is it correctly included in the sustainability facet?
Are there any important terms or concepts that we've missed? (including
acronyms, synonyms, variant spellings, inverted phrases)
11. SME spreadsheets
Each SME was slightly
different in how they
approached their
subject areas with
some SMEs being
reluctant to give much
feedback and others
giving large amounts
of feedback to sift
through.
Example of terms pulled from Law, Public administration/policy and International/global studies
14. Implementation of the Sustainability
Prototype
The thesaurus and semantic index are used for content discovery and
presentation
The identification of a “sustainability collection” from the JSTOR corpus was
performed using topic modeling (specifically LDA – Latent Dirichlet
Allocation)
A model of 100 topics was generated from the content
Staff assigned sustainability scores for each of the topics based on a review of
the top words in each topic
Each document in the JSTOR corpus was then assigned a sustainability score of
0-9 based on the sustainability scores for the topics most closely associated with
the document
15. Weighting of document-level indexed
terms
Document-level weights were computed for each sematic term using TF-
IDF
TF-IDF is a measure of how important a word is to a document in a collection
The TF-IDF value increases proportionally to the number of times the word
appears in a document (the ‘TF’ or term frequency), but is offset by how
common the word is in a corpus (the ‘IDF’ or inverse document frequency)
The TF-IDF weighted terms are used to:
order the terms displayed for each document
boost document relevancy when index terms are used in discovery
16. Auto-suggest and refining results
[Thesaurus slide: a new thing, metadata we create, screenshot(s) of
Sustainability Portal]
17. Refinements in our use of the thesaurus
and semantic index in sustainability
Auto calculation of sustainability score using LDA topics and thesaurus
sustainability facet
Calculate topics and term correlations
Compute sustainability score for each topic based on the most relevant terms and
sustainability facet
Compute a sustainability score for each corpus document based on topic weights
and topic sustainability score
Automated LDA topic labeling
Labeling topics generated by unsupervised topic modeling is an ongoing challenge
We’re investigating the feasibility of using the same topic/term correlations used to
compute sustainability scores to assign labels
Attempts to find the thesaurus term that best characterizes the most highly correlated
terms for each topic
18. Other JSTOR Labs projects/tools using
the thesaurus and semantic index
http://labs.jstor.org/jthes/
http://labs.jstor.org/snap/
http://labs.jstor.org/readings/
Thesaurus Visualization Tool
19. And some other JSTOR Labs projects
http://labs.jstor.org/reflowit/
http://labs.jstor.org/shakespeare/
Third year at DHUG. First year was building our thesaurus; Second year was maintenance and training and this year we are excited to present how we are utilizing the thesaurus in our content.
JTHES partners w/Labs (Their job is to get new ideas off the ground. To seek out new concepts and opportunities for JSTOR, & refine and validate them through research and experimentation)
Grey literature=academic literature that is not formally published
Natural resource economics deals with the supply, demand, and allocation of the Earth's natural resources
There is no simple definition of 'sustainability'... Most definitions include: 1. living within the limits of what the environment can provide. 2. understanding the many interconnections between economy, society and the environment. 3. the equal distribution of resources and opportunities. http://www.environment.nsw.gov.au/sustainability
Orig. had 18 TT but due to the Sustainability Collection we added Environmental studies.
terms would be used from multiple JTHES branches. What terms are always about sustainability (all/some rule) Only terms always and unambiguously pertaining to sustainability topics should be included. [e.g. Carbon vs. Green buildings]. I first pulled a list of what I thought sustainability terms would be (recycle/green etc) which equaled 1200. Then I pulled full branches [using File-Export function] in multiple areas including Architecture, Economics, Law etc. which totaled 18k.
BT-Explain more fully: As a rule we will only have a term living in up to 3 branches (e.g. Sustainable design lives in Sustainability science, Design engineering and Sustainable engineering).
Chose the facet approach- it didn’t do everything we wanted but for time/cost and execution it was the best choice. The facet is applied in the Admin module. The facet tab appears at the bottom of the term record along with other tabs such as Definition, History, Scope note etc. In the facet tab we added the word “Sustainability” to each term that was chosen for the collection.
New branches: Development studies, Environmental biology, Environmental social sciences, Environmental studies, Food studies, Sustainable architecture, Sustainable engineering, Urban studies, Wildlife studies
New rules: Build*/Buildings, Conservation, Ecology/Ecologic*, Environment, Environmental, Garden*, Green, Industr*/Industry/Industries, Soils, Sustainab*, Urbaniz*, Wildlife
Ended up with 6 SME’s total (Architecture, Engineering, Bio/Env science, Agriculture/Urban studies, Econ, Law/Policy). Once SME was secured an introductory phone conversation was set up where the project was defined and discussed. A brief review of basic taxonomic practices was given. The SME was sent a flat list of thesaurus terms (via Excel spreadsheet) within their discipline along with a short document on how to approach looking at the term and suggestions for feedback. Approx. 100-150 terms to review for each SME. Once the spreadsheet was completed, the taxonomists would review their suggestions and incorporate them into the jthes.
If SME said “no” to “Correctly included”, term was removed from facet. Suggested terms section (at bottom of the spreadsheet for SME to list out anything missing from the list).
Lessons learned: Spreadsheets could have been more user friendly (use drop down choices for columns); Suggested terms at the bottom of the spreadsheet overlapped with many SMEs; collate feedback first prior to implementation would have been more efficient. May have been helpful to send SMEs hierarchical view so they could see how terms relate to each other.
Since May, when I estimated we began work on adding terms for this project, 385 terms have been added to the JTHES, 1197 rules were added and over 400 rules were updated; Over 1500 terms are tagged in the facet.
Hand off to Ron for live demo portion of presentation.
Shakespeare: Using a primary text as a portal for locating secondary literature, specifically journal content available from JSTOR. Partnership with the Folger Library.
Classroom Readings: Helping teachers select content from JSTOR. Usage profiles/patterns that looked at a single institution and the spike of a documents usage on either side of a 2 week period.
Additional labs projects include Reflowit (for mobile device viewing)
Shakespeare: Using a primary text as a portal for locating secondary literature, specifically journal content available from JSTOR. Partnership with the Folger Library.
Classroom Readings: Helping teachers select content from JSTOR. Usage profiles/patterns that looked at a single institution and the spike of a documents usage on either side of a 2 week period.
Additional labs projects include Reflowit (for mobile device viewing)