Presentation given by Joseph Greene, Research Repository Librarian at University College Dublin Library, at Open Repositories 2016, held at Trinity College Dublin, June 13-16th, 2016.
1. Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
How accurate are IR
usage statistics?
Open Repositories 2016
Dublin, 16 June
2. Usage statistics are important for OA
repositories
• How is the service used overall?
• Advocacy
– Connects with authors on what is most important
to them: the use of their research
• KPI for return on investment
– Usage of a Library service
– Visibility of university’s
research
6. How accurate are they? Web robots
• Some follow rules
– Search engines, Internet Archive, link checkers,
Twitterbot, etc.
– robots.txt, naming themselves in the user agent
string
• Others do not
– Email spammers, comment spammers, dictionary
attackers, phishers, etc.
– Often mimic human users
7. Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Compared findings against our robot detection
technique
– U. Minho DSpace Stats Add-on
– Monthly outlier exclusion (manual)
Greene, J. Web robot detection in scholarly Open Access institutional
repositories. Library Hi Tech, July 2016
8. First finding
85% of the Research
Repository UCD’s
unfiltered downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
9. 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
10. How did we do at UCD?
• What proportion of robot downloads did we
catch? (Recall)
– Our method catches 94% of all robots
• How often were we correct -- how many are
actually human? (Precision)
– 98.9% of downloads that we label robots really
are robots
• How accurate are the download stats -- how
many are actually made by human beings?
(Inverse precision)
– 73% of the download statistics as reported are
human
11. How does that compare?
• Who knows? There are no other studies like this
on repositories!
• Applied DSpace's and EPrints' web robot
detection algorithms to our data
– Experimental
– Real data
– Same dataset used for each ‘system’
– Algorithms easy to mimic in vitro
– But SEO, crawl behaviour may be different for
different systems
12. Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making
Download and other usage statistics in an item view
In addition, data is provided to Schools for quality reviews and accreditation
Have been aware of web robots since 2009. Using U Minho plus visually checking for outliers once/month
Hit 1mil dls in 2015, decided we must know more about it (how to properly identify, how accurate our statistics are); want to have confidence in the information that we produce
Experiment: simple random sample of 2 years of download data (n=341, N=3.3 million for 96.20% certainty), manually checked to determine if robot or human. DSpace 1.8.2 with U. Minho DSpace Statistics Add-on v. 4. Apache Tomcat behind Apache HTTP server; logs in Apache Combined Log Format. Minho registers every download in the PostgreSQL database. Results to be published in July 2016 issue of Library Hi Tech (Greene 2016)
See:
INFORMATION POWER LTD. 2013. IRUS download data: identifying unusual usage [Online]. Available: http://www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf [Accessed 2015-12-11].
Confirms 85% figure
DORAN, D. & GOKHALE, S. S. 2011. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210.
Hypothesizes why so high in OA (p.191)
Typical website (15% robot traffic)
(precision = 0.8727, mean of four studies; robots:total sessions = 0.1516, mean of four studies)
OA journal (40% robot)
HUNTINGTON, P., NICHOLAS, D. & JAMALI, H. R. 2008. Web robot detection in the scholarly information environment. Journal of Information Science, 34, 726-741.
OA repositories (85% robot)
Greene 2016 and Information Power 2013 (see above)
Internet Archive (91% robot)
ALNOAMANY, Y., WEIGLE, M. C. & NELSON, M. L. 2013. Access patterns for robots and humans in web archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 339-348.
Reverse is also true: fail to catch robots (e.g. deterioration over time as robots improve their capabilities), accuracy of stats diminishes
Formula: Greene 2016
𝐏𝐢𝐧𝐯 = 𝐓𝐑(𝐑−𝐏𝐑−𝟏)+𝟐𝐓𝐏𝐑−𝐏(𝐓+𝐑−𝟏) 𝐑(𝐓𝐑−𝐏−𝐓)+𝐏
R = recall (robot detection)
P = precision (robot detection)
Pinv = inverse precision (human stats)
T = ratio of robots to total