This document summarizes the development of a project to visualize open access at MIT. It discusses the background of MIT's open access policy, prior efforts to collect open access articles, and the goals of the current project. The project uses log files and other data to build a pipeline that analyzes download statistics and stores them in databases. A web interface and email reports are used to provide usage statistics to authors. The project aims to incentivize further open access deposits and inform evaluation of MIT's open access policy.
1. June 11, 2015Matthew Bernhardt Open Repositories 2015
Visualizing Open Access
building a scalable infrastructure to
showcase the reach of MIT research
3. June 11, 2015Matthew Bernhardt Open Repositories 2015
Background
March 18, 2009 - Open Access Policy adopted
“...The policy is to take effect immediately; it will be reviewed after five years by
the Faculty Policy Committee, with a report presented to the Faculty.”
4. June 11, 2015Matthew Bernhardt Open Repositories 2015
Background
March 18, 2009 - Open Access Policy adopted
“...The policy is to take effect immediately; it will be reviewed after five years by
the Faculty Policy Committee, with a report presented to the Faculty.”
2009 – 2013
MIT Libraries assemble a collection within Dspace@MIT for Open Access
Articles.
5. June 11, 2015Matthew Bernhardt Open Repositories 2015
Background
March 18, 2009 - Open Access Policy adopted
“...The policy is to take effect immediately; it will be reviewed after five years by
the Faculty Policy Committee, with a report presented to the Faculty.”
2009 – 2013
MIT Libraries assemble a collection within Dspace@MIT for Open Access
Articles.
~10,000 articles, ~ 1.5 million downloads
6. June 11, 2015Matthew Bernhardt Open Repositories 2015
Background
~10,000 articles, ~1.5 million downloads, but…
Author-level information?
Department-level information?
7. June 11, 2015Matthew Bernhardt Open Repositories 2015
Project
August 2013 - Project begins
“Implement author-level, article-level, and aggregated article download usage
statistics for articles in the Open Access Articles Collection in DSpace@MIT to
incentivize deposits and provide useful assessment information for the MIT
Faculty Open Access Policy.”
9. June 11, 2015Matthew Bernhardt Open Repositories 2015
Prior Work
MyDASH provided solid model…
• Map
• Timeline
• Summary table
10. June 11, 2015Matthew Bernhardt Open Repositories 2015
Prior Work
MyDASH provided solid model…
• Map
• Timeline
• Summary table
… but couldn’t be directly implemented.
• Repository versus One Collection
• Multiple department affiliations
11. June 11, 2015Matthew Bernhardt Open Repositories 2015
Project Goals
• Make available download statistics at three levels:
author, article, and aggregate
• Incentivize deposits to collection
• Provide useful information for policy evaluation
12. June 11, 2015Matthew Bernhardt Open Repositories 2015
Project Goals
• Make available download statistics at three levels:
author, article, and aggregate
• Incentivize deposits to collection
• Provide useful information for policy evaluation
• Evaluate new technologies within the Libraries (i.e.
MongoDB)
13. June 11, 2015Matthew Bernhardt Open Repositories 2015
Not Project Goals
• Integration with altmetrics systems
• COUNTER
18. June 11, 2015Matthew Bernhardt Open Repositories 2015
Pipeline
Start from Apache server logs
● Filter the qualifying downloads
● Look up the downloaded paper
● Augment with additional information
● Store in MongoDB
● Use SOLR to build summary collection
UI queries summary collection
24. June 11, 2015Matthew Bernhardt Open Repositories 2015
Pipeline challenges - departments
Department names
● Inconsistent program / department affiliations
o “Media Laboratory”
o “Center for Bits and Atoms” (subgroup within Media Lab)
● Spelling Variations
o “MIT Department of Physics”
o “Massachusetts Institute of Technology, Department of Physics”
o “Dept. of Physics”
o “Physics”
25. June 11, 2015Matthew Bernhardt Open Repositories 2015
Pipeline challenges - departments
Standardized department names
Whitelist of recognized names
Separate variations for display and linking back
to DSpace@MIT
26. June 11, 2015Matthew Bernhardt Open Repositories 2015
{
"_id": ObjectId("5449127895b0c25083f29352"),
"handle": "http://hdl.handle.net/1721.1/52491",
"title": "A basal ganglia-forebrain circuit in the songbird biases motor output to avoid vocal errors",
"country": "USA",
"authors": [
{ "mitid": "3.1415926537", "name": "Fee, Michale S.“ },
{ "mitid": "6.02x10^23", "name": "Andalman, Aaron S." }
],
"dlcs": [
{
"display": "McGovern Institute for Brain Research at MIT",
"canonical": "McGovern Institute for Brain Research at MIT"
},
{
"display": "Brain and Cognitive Sciences",
"canonical": "Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences"
Augmented download record
37. June 11, 2015Matthew Bernhardt Open Repositories 2015
Email to authors
Dear {name},
Thank you for sharing your scholarly articles through the open repository DSpace@MIT <https://dspace.mit.edu/handle/1721.1/49433/>, in association with the MIT Faculty Open
Access Policy <https://libraries.mit.edu/oapolicy>.
Our newly implemented OA Stats Service provides data about the use and reach of our open access collection. Since August 2010, 15,184 articles have been downloaded from
227 different countries.
This service also provides information at the author and article level:
Your {count_articles} articles have been downloaded {count_downloads} times since they were deposited, from {count_countries} different countries.
You can access more detailed download information about your articles, including per-article and per-country downloads at <https://oastats.mit.edu>.
Initially, we plan to provide this information to all authors via email in the Fall and Spring semesters. As we seek to improve the service, we'll consider expanding options to
interact with it and the underlying data.
We are anxious to hear your feedback on how this service can be most useful to you, so please send your suggestions to oastats@mit.edu.
--From the MIT Libraries
40. June 11, 2015Matthew Bernhardt Open Repositories 2015
Faculty reception
Excitement
● “Thank you for the update, this is a fantastic tool!!”
● “Thanks so much for doing this - it's really cool and awesome!”
41. June 11, 2015Matthew Bernhardt Open Repositories 2015
Faculty reception
Excitement
● “Thank you for the update, this is a fantastic tool!!”
● “Thanks so much for doing this - it's really cool and awesome!”
Why not more?
● “Hi, I like your feedback. But I am puzzled that only one of my articles is in
your database.”
● Department heads using this as leverage to encourage further
contributions
42. June 11, 2015Matthew Bernhardt Open Repositories 2015
Project goals revisited
• Make available download statistics at three levels:
author, article, and aggregate
• Incentivize deposits to collection
• Provide useful information for policy evaluation
• Evaluate new technologies within the Libraries (i.e.
MongoDB)
43. June 11, 2015Matthew Bernhardt Open Repositories 2015
Future work
● Automate the pipeline
● Run pipeline more frequently
● Ditch Mongo for something relational
● Talk to faculty about making more detailed information
public
● Add functionality to UI (more export formats, SPA)
● Improve cataloging in DSpace@MIT with lookup
services
44. June 11, 2015Matthew Bernhardt Open Repositories 2015
Thanks!
Matt Bernhardt
mjbernha@mit.edu
@morphosis7
https://github.com/MITLibraries/oastats-backend
https://github.com/MITLibraries/oastats-ui
https://github.com/MITLibraries/poast
http://oastats.mit.edu
Notes de l'éditeur
This presentation describes the effort to build a reporting service for open access article downloads at the MIT Libraries. This project began in the fall of 2013, and launched during Open Access Week in 2014. The web interface can be seen at http://oastats.mit.edu.
The background to this project started in 2009, when the faculty of MIT adopted the Open Access Policy. One component of the policy was the call for a review after five years.
The full policy can be seen at http://libraries.mit.edu/scholarly/mit-open-access/open-access-at-mit/mit-open-access-policy/
In the years after the adoption of the policy, the MIT Libraries created and populated an Open Access collection within the DSpace@MIT repository.
By the summer of 2013, this collection contained approximately 10,000 articles. These articles had been downloaded approximately 1.5 million times.
However, more refined download information was not available. The Libraries could not say how often a given paper was downloaded, nor how often a given author’s papers were downloaded, nor how often the papers of a given department were downloaded.
The Libraries did have server logs going back almost to the founding of the open access collection, however. If these logs could be processed accurately, this sort of information could be uncovered.
In order to provide download counts at these various levels of resolution, the Libraries began a project to build a reporting service.
Harvard Libraries had unveiled MyDASH, which served as an inspiration to our early work. MyDASH can be seen at https://osc.hul.harvard.edu/dash/mydash
The MyDASH service provided maps, timelines, and other summary information about downloads from Harvard’s repository.
Unfortunately, due to differences between our repositories, we were unable simply to implement the MyDASH software at MIT.
Nonetheless, inspired by the MyDASH project, the MIT project included similar output among its goals. We would provide summary information, at varying levels of refinement, for item downloads.
By providing this information, we hoped to incentivize further downloads, and provide needed context for the evaluation of the open access policy itself.
The developer’s group also had an internal goal to evaluate MongoDB as a platform.
This project is distinct from other, similar, efforts in the repository space that deal with altmetrics and COUNTER. Our goals were simply to take existing server logs, process them, and make the information contained therein available.
Our project team conceived of a data processing pipeline, extracting relevant server log entries, augmenting with additional information, and storing the resulting records in a database that would be queried by a web interface. This image shows the whiteboard at the end of our first planning meeting. The photograph was taken by Matt Bernhardt.
Our project thus consists of three parts, each with its code posted on GitHub:
A data processing pipeline that ingests Apache server logs, augments with a few external sources of data, and stores the results in a MongoDB collection.
A visualization interface that surfaces the contents of the MongoDB collection using standard libraries such as d3.js
An email notification service that sends summary emails to authors represented in the MongoDB collection.
The processing pipeline was written in Python by Michael Graves, and makes use of a number of technologies including DSpace itself.
The pipeline follows several steps:
Start from Apache logs
Filter out OA downloads
Filter out bots
Augment with author identities
Augment with geo-referenced IP addresses
Store in raw Mongo collection
Generate summary collection via SOLR
https://github.com/MITLibraries/oastats-backend
This diagram shows the pipeline in a more visual layout. One of the keys to the project was the ability to locate relevant data within DSpace@MIT.
Getting useful information out of the pipeline required us to address several data quality issues. Two of the biggest were the treatment of author names and department names within the collection.
Author names were cataloged in the collection as uncontrolled strings, without an identity behind them. Because of variations in how author names are provided by various journals, the authors themselves, and other sources, the same person frequently appeared under different names.
A second, converse, problem was ambiguity created when multiple people were referred to by similar (or identical) strings. One particularly unique case was that of a father and son, with nearly identical names, who briefly were affiliated with the same partner – and in some cases the same papers.
Our solution was to attach a hidden JSON bitstream to each article in the collection, which referenced each author name back to an MIT ID value.
This slide demonstrates a sample structure of an identity bitstream. The MIT ID values here have been replaced with nonsense values.
A second challenge to the pipeline was the treatment of department names.
One aspect of this challenge was variation in cataloging practices. For example, some papers would be referenced only to the Media Laboratory, while others would reference a subgroup such as the Center for Bits and Atoms. Our project team had to determine the level of organizational resolution we could deliver.
Another aspect was the variation in department names across papers – for example, the varying ways in which a paper might refer to the Department of Physics.
Our response to these challenges included several steps. First, we decided on a standardized list of department names, and made sure that the records were catalogued accordingly.
Second, we created a whitelist of recognized department names, and made sure that all papers contained at least one of these names. Thus, a record that originally referenced only the Center for Bits and Atoms would be edited to include a reference to the Media Lab (its parent organization). The original reference to the Center for Bits and Atoms could be preserved, but not used in the processing pipeline.
Finally, we stored the download record using a pairing of a shorter display name for a department, and a longer canonical name that matched how each department was referred to in DSpace@MIT.
This example download record shows how a single record was augmented with author identities, department name objects, and a georeferenced country code locating the requesting IP address on a map.
After generating this collection of download records, we realized that a summary collection would be needed. MongoDB was not performant enough for us to build a visualization platform off the augmented collection of download events itself. This slide shows what a sample summary document contains.
The
The web interface that was built atop this MongoDB collection can be seen at http://oastats.mit.edu
The web interface was written in PHP, and included the D3.js visualization framework as well as some additional helper libraries. The source code is available on GitHub.
The interface provides two parallel visaulization paths – one for the general public, and one for authors. The public is provided information about the downloads from departments, labs, and centers across MIT. Individual authors are also provided information about the downloads of their individual papers. Both paths use the same codebase for maintainability.
The project team went through several iterations of prototypes, including sketches and whiteboard, before settling on an interface that closely matches that of DSpace@MIT.
The final interface provides three types of summary information. First is a table showing total download counts for the requested level of resolution (here, a table of department-level data). Authors would see a table showing each paper.
The second option is for a timeline, showing the cumulative total downloads as of each point in time.
The third option is a map, indicating how many downloads have been requested from each country. The maps is provided as a five-category choropleth.
The final component of this project was to build an email reporting system.
Authors whose papers have been downloaded more than a certain threshold (currently 20 times) receive a simple text email inviting them to view their download data via the web interface. The email also includes basic download data about the open access collection as whole.
The first time this email service was used, the visualiation interface experienced a significant surge in traffic.
There are also some kinks to be worked out about what email addresses we use. This slide depicts a set of 4,000+ bounce messages that resulted from using an inaccurate list of active email addresses.
The feedback we’ve received from faculty and administrators about this project has been almost entirely positive.
The most exciting part of the feedback has been from authors wanting to know why more of their papers are not represented in the collection. At least one department head has used this system as leverage to encourage further contributions.
Returning to our set of project goals, almost all of them were met. The final service contains the types of visualizations we envisioned, and the feedback we’ve received indicates that authors have been incentivized to submit additional materials to the collection. The developers also now have more experience with MongoDB.
The one goal of which we are unsure is the hope that this service is useful for the faculty as they review the open access policy. That work is still ongoing.
Our goals for the future of this project are several-fold. There are efficiencies which need to be introduced in the pipeline and cataloging interfaces, and new features to be added to the web interface.
We have also decided to move off of MongoDB to a relational database.
We also hope to approach the faculty about providing more detailed information to the public, rather than just department-level summaries.
Thanks!
The source code for all three components of this project has been posted to GitHub at the linked addresses on the slide:
https://github.com/MITLibraries/oastats-backend
https://github.com/MITLibraries/oastats-ui
https://github.com/MITLibraries/poast
The reporting interface itself can be seen at:
http://oastats.mit.edu