Zohreh Zahedi, Stefanie Haustein & Tim Bowman (2014). Exploring data quality and retrieval strategies for Mendeley reader counts. Presentation at SIGMET Metrics 2014 workshop, 5 November 2014, Seattle, WA (USA)
Exploring data quality and retrieval strategies for Mendeley reader counts
1. Metrics14 - ASIS&T SIGMET Workshop, Seattle, 5th November, 2014
Exploring data quality and retrieval
strategies for Mendeley reader counts
Zohreh Zahedi1, Stefanie Haustein2 & Timothy D. Bowman2
z.zahedi.2@cwts.leidenuniv.nl stefanie.haustein@umontreal.ca tim.bowman@gmail.com
@zohrehzahedi @stefhaustein @timothydbowman
1Leiden University, The Netherlands
2Université de Montréal, Canada
2. • online reference management tool
• usage statistics, available via open API
3. • 2.8 million users, 275,860 groups,
535 user documents (02/2014)
• 68 million unique publications (08/2012;
281 million user documents)
Mendeley statistics based on monthly user counts from 10/2010 to 02/2014 on the Mendeley website accessed through the Internet Archive
5. Research Objectives
• fluctuation in Mendeley coverage and readership
counts over time and through different retrieval
strategies (Bar-Ilan, 2014)
• altmetric studies and tools use different retrieval
strategies
• DOI API search
• title search (e.g., Webometric Analyst)
lack of systematic study to determine effect of
retrieval strategy
6. Research Objectives
• analyzing metadata quality of Mendeley entries
systematically
• testing completeness and accuracy of relevant
metadata fields
• identify and quantify error types
• analyze difference between retrieval strategies
determine best retrieval strategy for collecting
Mendeley reader counts
7. Research Questions
• How accurate is the metadata on Mendeley for a
random sample of publications?
• In how far do results differ between:
• manual title search in online catalog
• API search via DOI
• What are the most frequent error types in the
bibliographic data on Mendeley?
• What retrieval strategy provides the most
accurate and complete results for the sampled
publications?
8. Data set and Method
• random sample of 2012 WoS publications:
384 of 1,873,759 documents
• manual title search via Mendeley online catalog
n=384
• DOI search via Mendeley API simultaneously
n=264 (=-31%)
• comparison of all relevant metadata
• Author
• DOI
• ISSN
• Pages
• Source
• Title
• Title
• Volume
• Year
11. Results: overview
n=264
2 false positives
91.3% of searched documents
n=384
47.4% of searched documents
12. Results: overview
documents reader counts
N % N % +
identical reader counts 103 36.4 975 41.1 0
identical 102 36.0 975 41.1 0
identical, both 0 1 0.4 0 0 0
API higher 111 39.2 752 31.7 718
API higher 10 3.5 204 8.6 170
API higher, manual not found 80 28.3 548 23.1 548
API 0, manual not found 21 7.4 0 0 0
manual higher 69 24.4 644 27.2 563
manual higher 21 7.4 379 16.0 298
manual higher, API not found 40 14.1 242 10.2 242
manual higher, API 0 6 2.1 23 1.0 23
manual 0, API not found 2 0.7 0 0 0
all documents 283 100.0 2,371 100.0 1,281
13. Results: incorrect metadata
Title search
n=182
DOI search
n=241
93%
92%
87%
90%
80%
73%
85%
94%
99%
7%
4%
13%
6%
14%
27%
15%
6%
1%
Author
DOI
ISSN
Issue
Pages
Source
Title
Volume
Year
6%
0%*
68%
10%
10%
24%
18%
7%
1%
94%
100%*
32%
83%
83%
76%
82%
91%
99%
*the API DOI search retrieved two false positives which are not included in this analysis
15. Conclusions
• errors in fields commonly used for matching:
• Title: 15/18%
• First author: 7/6%
• Year: 1/1%
• source (27/24%), ISSN (13/68%), volume (6/7%),
issue (6/10%), page number (14/10%) should not
be used for matching
• special characters produce most errors, removing
them would resolve large share of errors:
• Title: 81/84%
• First author: 67/73%
16. Conclusions
• results of retrieval strategies:
• manual title: 182 (64%) documents & 1,653 readers
• API DOI: 241 (85%) & 1,808
• combined: 283 & 2,371 (max) / 2,486 (sum)
• DOI search found 101 (36%) additional documents,
but:
• could not be applied to 120 (31%) documents w/out DOI
• did not retrieve 42 (15%) documents found by title
search
• led to 2 (1%) false positives
combination of DOI and title search w/out special
characters