How Accurate are IR Usage Statistics?

Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
How accurate are IR
usage statistics?
Open Repositories 2016
Dublin, 16 June

Usage statistics are important for OA
repositories
• How is the service used overall?
• Advocacy
– Connects with authors on what is most important
to them: the use of their research
• KPI for return on investment
– Usage of a Library service
– Visibility of university’s
research

Monthly email sent to all
depositors

Infographic distributed semi-annually
by College Liaison Librarians

How accurate are they? Web robots
• Some follow rules
– Search engines, Internet Archive, link checkers,
Twitterbot, etc.
– robots.txt, naming themselves in the user agent
string
• Others do not
– Email spammers, comment spammers, dictionary
attackers, phishers, etc.
– Often mimic human users

Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Compared findings against our robot detection
technique
– U. Minho DSpace Stats Add-on
– Monthly outlier exclusion (manual)
Greene, J. Web robot detection in scholarly Open Access institutional
repositories. Library Hi Tech, July 2016

First finding
85% of the Research
Repository UCD’s
unfiltered downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot

How did we do at UCD?
• What proportion of robot downloads did we
catch? (Recall)
– Our method catches 94% of all robots
• How often were we correct -- how many are
actually human? (Precision)
– 98.9% of downloads that we label robots really
are robots
• How accurate are the download stats -- how
many are actually made by human beings?
(Inverse precision)
– 73% of the download statistics as reported are
human

How does that compare?
• Who knows? There are no other studies like this
on repositories!
• Applied DSpace's and EPrints' web robot
detection algorithms to our data
– Experimental
– Real data
– Same dataset used for each ‘system’
– Algorithms easy to mimic in vitro
– But SEO, crawl behaviour may be different for
different systems

Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making

0.897 0.911 0.890
0.942
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no manual
outlier checking)
Minho plus monthly
manual checking
(UCD)
Robots detected (Recall)

1.000
0.940
0.989 0.989
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no manual
outlier checking)
Minho plus monthly
manual checking
(UCD)
Accuracy of detection (Precision)

0.620
0.552 0.590
0.730
0.144
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no
manual outlier
checking)
Minho plus
monthly manual
checking (UCD)
Without
filtration
Accuracy of download stats
(Inverse precision)
I.e. 38% of DSpace’s
reported downloads are
made by robots, etc.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
checking (UCD)
No robot
detection
Robot detection in OA IR systems
Recall Precision Negative precision (accuracy of download stats)

How Accurate are IR Usage Statistics?

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à How Accurate are IR Usage Statistics?

Similaire à How Accurate are IR Usage Statistics? (20)

Plus de UCD Library

Plus de UCD Library (20)

Dernier

Dernier (20)

How Accurate are IR Usage Statistics?

Notes de l'éditeur