1. JSTOR
Advanced Technology Research
Denver
25th January 2008
John Burns
Clare Llewellyn
l
2. Today we will introduce a public beta of our Data for
Research service and show you some of the other
services that JSTOR’s advanced technology group
is working on.
Mission: Working with other researchers on large-
scale text and data mining initiatives with an eye
toward beneficial applications for scholars and
students.
l
3. What is Data Mining?
“Data mining is the process of extracting hidden patterns from data”
Lyman and Varian 2003
“As data sets and the information extracted from them have grown in
size and complexity, direct hands-on data analysis has increasingly
been supplemented and augmented with indirect, automatic data
processing using more complex and sophisticated tools, methods and
models”
Kantardizic 2002
Example:
Data mining is using consumer purchasing patterns to predict which
products are bought together (gas and flights)
l
4. What is Text Mining?
“In text mining the patterns are extracted from natural language text
rather than from structured databases of facts”
Marti Hearst 2003
“Text mining attempts to discover new, previously unknown
information by applying techniques from information retrieval,
natural language processing and data mining”
National Text Mining Center, UK
Example:
Looking at which words co-occur in articles that in order to predict
interactions (magnesium and migraines)
l
6. Why are we releasing our system here?
Librarians are the point from which innovation is spread throughout the
academy
“New roles and functions for librarians include:
• information consultants and producers
• information gatekeepers and intermediators
• end-user educators
• managers and leaders
• data analysts in data administration centers
• preservers of knowledge
• information equalizers”
Park 1987
A Data Support Role: “Helping students get their hands dirty with the
data”
Robin Rice 2008
2nd DCC / RIN Research Data Management Forum
l
7. Who we are - Advanced Technology Research
• A formal commitment by JSTOR to a pro-active role in technology
innovation to face new challenges and opportunities
• Our MO is to collaborate with and aid the scholarly community
• We area team of world-class scientists and technologists with a proven
track record of innovation
Mission Statement
“The Advanced Technology Research Group is dedicated to creating,
discovering and using relevant technologies in support of JSTOR and the
broader scholarly community.”
l
8. ATR - Collaborations with the academic community.
For other researchers we provide
• Access to large well-curated data sets
• An exposure channel on JSTOR for research results
• Facilities on JSTOR to expose tools and techniques to users
• Collaboration opportunities
For JSTOR
• We evaluate novel techniques
• We present rapid prototypes to users
• Develop peer relationships with research institutions
• Bring new forms of traffic to the JSTOR data
• Reuse JSTOR data in new and exciting ways
l
9. What we are doing - Projects and Partners
• University of Washington – Citation Network Analysis
• University of Princeton – Topic Analysis
• UIUC - Software Environment for the Advancement of Scholarly
Research (SEASR)
• University of Michigan – Linguistic tools
• Tufts -Classics Studies
• University of Liverpool – OAI-ORE, Text Mining, Data Analysis
• University of Queensland - Annotations
• Los Alamos National Labs – Annotation Management
• DFKI (German Artificial Intelligence Centre) – Document capture
and reconstruction / remastering.
• XRCE (EuroPARC, France) – Scanned Document Analysis
• …
l
10. Advanced Technology Research - Showcase
Showcase provides a preview of interesting and useful
technologies. It allows our research partners to demonstrate
their tools and gain feedback and it allows JSTOR to assess
candidate technologies before committing them to the product
roadmap.
l
11. Advanced Technology Research - Showcase
A place to expose JSTOR data and tools and to encourage new
research
• Provides access to JSTOR datasets
• Facility to expose and use tools created by researchers from
JSTOR and elsewhere.
• Explanation of ongoing research
• As a forum to facilitate connections between groups working with
JSTOR data
URL: http://showcase.jstor.org
l
12. Data for Research
• DFR is a set of web tools designed to allow for the visual
exploration of large-scale data sets and the download of word
frequencies in JSTOR articles
• Beta Version launched 01/23/09
• URL: http://dfr.jstor.org
l
13. Why Word Frequencies
Data Requested from JSTOR users in 2008
OCR Data
Citation Data
Usage Data
Word Frequency
l
14. What can you do with work counts?
Real life requests:
“I would like to request time and word distribution frequencies in
linguistics (specific movement removed). These sorts of
frequencies could potentially allow me to better understand and
delimit the formation of groups, and the underlying impetus
behind these groups as expressed in linguistic form.”
“I would like to create subject headings for material, using word
frequency as a guide to selecting the appropriate terms for the
headings.”
l
25. 3 Journals from 1957
The Annals Mathematics American Journal Nursing Agricultural History
l
26. Any questions / feedback?
Please take a look at the site and tell us what you think.
Email: dfr@jstor.org
Contact details
Email: clare.llewellyn@jstor.org
Phone: 609-986-2282
l