SlideShare une entreprise Scribd logo
1  sur  18
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
How accurate are IR
usage statistics?
Open Repositories 2016
Dublin, 16 June
Usage statistics are important for OA
repositories
• How is the service used overall?
• Advocacy
– Connects with authors on what is most important
to them: the use of their research
• KPI for return on investment
– Usage of a Library service
– Visibility of university’s
research
Monthly email sent to all
depositors
Infographic distributed semi-annually
by College Liaison Librarians
How accurate are they? Web robots
• Some follow rules
– Search engines, Internet Archive, link checkers,
Twitterbot, etc.
– robots.txt, naming themselves in the user agent
string
• Others do not
– Email spammers, comment spammers, dictionary
attackers, phishers, etc.
– Often mimic human users
Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Compared findings against our robot detection
technique
– U. Minho DSpace Stats Add-on
– Monthly outlier exclusion (manual)
Greene, J. Web robot detection in scholarly Open Access institutional
repositories. Library Hi Tech, July 2016
First finding
85% of the Research
Repository UCD’s
unfiltered downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
How did we do at UCD?
• What proportion of robot downloads did we
catch? (Recall)
– Our method catches 94% of all robots
• How often were we correct -- how many are
actually human? (Precision)
– 98.9% of downloads that we label robots really
are robots
• How accurate are the download stats -- how
many are actually made by human beings?
(Inverse precision)
– 73% of the download statistics as reported are
human
How does that compare?
• Who knows? There are no other studies like this
on repositories!
• Applied DSpace's and EPrints' web robot
detection algorithms to our data
– Experimental
– Real data
– Same dataset used for each ‘system’
– Algorithms easy to mimic in vitro
– But SEO, crawl behaviour may be different for
different systems
Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making
Results
0.897 0.911 0.890
0.942
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no manual
outlier checking)
Minho plus monthly
manual checking
(UCD)
Robots detected (Recall)
1.000
0.940
0.989 0.989
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no manual
outlier checking)
Minho plus monthly
manual checking
(UCD)
Accuracy of detection (Precision)
0.620
0.552 0.590
0.730
0.144
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho (no
manual outlier
checking)
Minho plus
monthly manual
checking (UCD)
Without
filtration
Accuracy of download stats
(Inverse precision)
I.e. 38% of DSpace’s
reported downloads are
made by robots, etc.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
checking (UCD)
No robot
detection
Robot detection in OA IR systems
Recall Precision Negative precision (accuracy of download stats)
Thank you!

Contenu connexe

En vedette

Week 2 Uf 5163
Week 2 Uf 5163Week 2 Uf 5163
Week 2 Uf 5163
Mohd Yusak
 
Presentation of #da12data initiative in the Open Data Week, Nantes
Presentation of #da12data  initiative in the Open Data Week, NantesPresentation of #da12data  initiative in the Open Data Week, Nantes
Presentation of #da12data initiative in the Open Data Week, Nantes
Marc Garriga
 

En vedette (20)

Web Squared - dal web 2.0 al web al quadrato
Web Squared - dal web 2.0 al web al quadratoWeb Squared - dal web 2.0 al web al quadrato
Web Squared - dal web 2.0 al web al quadrato
 
Visibility and Engagement: Using Social Media for Your Work
Visibility and Engagement: Using Social Media for Your WorkVisibility and Engagement: Using Social Media for Your Work
Visibility and Engagement: Using Social Media for Your Work
 
Week 2 Uf 5163
Week 2 Uf 5163Week 2 Uf 5163
Week 2 Uf 5163
 
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
 
Last news from New York / Buzz the Brand 2011
Last news from New York / Buzz the Brand 2011Last news from New York / Buzz the Brand 2011
Last news from New York / Buzz the Brand 2011
 
OpenGovernment
OpenGovernmentOpenGovernment
OpenGovernment
 
Access to virtual & physical resources. Author: Eoin McCarney
Access to virtual & physical resources. Author: Eoin McCarneyAccess to virtual & physical resources. Author: Eoin McCarney
Access to virtual & physical resources. Author: Eoin McCarney
 
mdalton_IFLA
mdalton_IFLAmdalton_IFLA
mdalton_IFLA
 
Graphis Feature
Graphis FeatureGraphis Feature
Graphis Feature
 
Confluence
ConfluenceConfluence
Confluence
 
Presentació de Web 2.0 a l'Ajuntament de Barcelona
Presentació de Web 2.0 a l'Ajuntament de BarcelonaPresentació de Web 2.0 a l'Ajuntament de Barcelona
Presentació de Web 2.0 a l'Ajuntament de Barcelona
 
Custom Components In Flex 4
Custom Components In Flex 4Custom Components In Flex 4
Custom Components In Flex 4
 
New Competencies for the Academic Librarian: A Case Study of Patron-Driven Ac...
New Competencies for the Academic Librarian: A Case Study of Patron-Driven Ac...New Competencies for the Academic Librarian: A Case Study of Patron-Driven Ac...
New Competencies for the Academic Librarian: A Case Study of Patron-Driven Ac...
 
Seeing through learners' eyes
Seeing through learners' eyesSeeing through learners' eyes
Seeing through learners' eyes
 
Loex 2008 (P2)
Loex 2008 (P2)Loex 2008 (P2)
Loex 2008 (P2)
 
Web 2.0 in Campaigns
Web 2.0 in CampaignsWeb 2.0 in Campaigns
Web 2.0 in Campaigns
 
The Information Literacy Impact Factor: How to Measure Value - Author: Lorna ...
The Information Literacy Impact Factor: How to Measure Value - Author: Lorna ...The Information Literacy Impact Factor: How to Measure Value - Author: Lorna ...
The Information Literacy Impact Factor: How to Measure Value - Author: Lorna ...
 
Presentation of #da12data initiative in the Open Data Week, Nantes
Presentation of #da12data  initiative in the Open Data Week, NantesPresentation of #da12data  initiative in the Open Data Week, Nantes
Presentation of #da12data initiative in the Open Data Week, Nantes
 
Introduction
IntroductionIntroduction
Introduction
 
Presentation of iCity Project at Polytechnic University of Catalonia (Compute...
Presentation of iCity Project at Polytechnic University of Catalonia (Compute...Presentation of iCity Project at Polytechnic University of Catalonia (Compute...
Presentation of iCity Project at Polytechnic University of Catalonia (Compute...
 

Similaire à How Accurate are IR Usage Statistics?

eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdfML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
AvijitChaudhuri3
 

Similaire à How Accurate are IR Usage Statistics? (20)

Developing COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access ResourcesDeveloping COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access Resources
 
Unit 1
Unit 1Unit 1
Unit 1
 
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...
 
Usability Report - Discovery Tools
Usability Report - Discovery ToolsUsability Report - Discovery Tools
Usability Report - Discovery Tools
 
COUNTER Standards for Open Access: the Value of Measuring/ the Measuring of V...
COUNTER Standards for Open Access: the Value of Measuring/ the Measuring of V...COUNTER Standards for Open Access: the Value of Measuring/ the Measuring of V...
COUNTER Standards for Open Access: the Value of Measuring/ the Measuring of V...
 
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
 
We Went Mobile! (Or Did We?)
We Went Mobile! (Or Did We?) We Went Mobile! (Or Did We?)
We Went Mobile! (Or Did We?)
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Discovery study detailed results 20140728
Discovery study detailed results 20140728Discovery study detailed results 20140728
Discovery study detailed results 20140728
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
Digital libraries
Digital librariesDigital libraries
Digital libraries
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Sentiment mining- The Design and Implementation of an Internet PublicOpinion...
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdfML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
ML.pptvdvdvdvdvdfvdfgvdsdgdsfgdfgdfgdfgdf
 

Plus de UCD Library

Plus de UCD Library (20)

The role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrityThe role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrity
 
Collection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD LibraryCollection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD Library
 
The authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA HumanitiesThe authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA Humanities
 
Show and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and educationShow and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and education
 
Print to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital LibraryPrint to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital Library
 
Appearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishersAppearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishers
 
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for Researchers
 
Going Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in ChinaGoing Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in China
 
Going Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in ChinaGoing Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in China
 
Clifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an OverviewClifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an Overview
 
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
 
Creating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital CollectionCreating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital Collection
 
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
 
Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...
 
UCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining CollectionsUCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining Collections
 
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
 
Pin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locationsPin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locations
 
Real Life Digital Curation and Preservation
Real Life Digital Curation and PreservationReal Life Digital Curation and Preservation
Real Life Digital Curation and Preservation
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Dernier (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

How Accurate are IR Usage Statistics?

  • 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie How accurate are IR usage statistics? Open Repositories 2016 Dublin, 16 June
  • 2. Usage statistics are important for OA repositories • How is the service used overall? • Advocacy – Connects with authors on what is most important to them: the use of their research • KPI for return on investment – Usage of a Library service – Visibility of university’s research
  • 3.
  • 4. Monthly email sent to all depositors
  • 5. Infographic distributed semi-annually by College Liaison Librarians
  • 6. How accurate are they? Web robots • Some follow rules – Search engines, Internet Archive, link checkers, Twitterbot, etc. – robots.txt, naming themselves in the user agent string • Others do not – Email spammers, comment spammers, dictionary attackers, phishers, etc. – Often mimic human users
  • 7. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Compared findings against our robot detection technique – U. Minho DSpace Stats Add-on – Monthly outlier exclusion (manual) Greene, J. Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, July 2016
  • 8. First finding 85% of the Research Repository UCD’s unfiltered downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  • 9. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  • 10. How did we do at UCD? • What proportion of robot downloads did we catch? (Recall) – Our method catches 94% of all robots • How often were we correct -- how many are actually human? (Precision) – 98.9% of downloads that we label robots really are robots • How accurate are the download stats -- how many are actually made by human beings? (Inverse precision) – 73% of the download statistics as reported are human
  • 11. How does that compare? • Who knows? There are no other studies like this on repositories! • Applied DSpace's and EPrints' web robot detection algorithms to our data – Experimental – Real data – Same dataset used for each ‘system’ – Algorithms easy to mimic in vitro – But SEO, crawl behaviour may be different for different systems
  • 12. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  • 14. 0.897 0.911 0.890 0.942 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Robots detected (Recall)
  • 15. 1.000 0.940 0.989 0.989 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Accuracy of detection (Precision)
  • 16. 0.620 0.552 0.590 0.730 0.144 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho (no manual outlier checking) Minho plus monthly manual checking (UCD) Without filtration Accuracy of download stats (Inverse precision) I.e. 38% of DSpace’s reported downloads are made by robots, etc.
  • 17. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking (UCD) No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)

Notes de l'éditeur

  1. Download and other usage statistics in an item view
  2. In addition, data is provided to Schools for quality reviews and accreditation
  3. Have been aware of web robots since 2009. Using U Minho plus visually checking for outliers once/month Hit 1mil dls in 2015, decided we must know more about it (how to properly identify, how accurate our statistics are); want to have confidence in the information that we produce
  4. Experiment: simple random sample of 2 years of download data (n=341, N=3.3 million for 96.20% certainty), manually checked to determine if robot or human. DSpace 1.8.2 with U. Minho DSpace Statistics Add-on v. 4. Apache Tomcat behind Apache HTTP server; logs in Apache Combined Log Format. Minho registers every download in the PostgreSQL database. Results to be published in July 2016 issue of Library Hi Tech (Greene 2016)
  5. See: INFORMATION POWER LTD. 2013. IRUS download data: identifying unusual usage [Online]. Available: http://www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf [Accessed 2015-12-11]. Confirms 85% figure DORAN, D. & GOKHALE, S. S. 2011. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210. Hypothesizes why so high in OA (p.191)
  6. Typical website (15% robot traffic) (precision = 0.8727, mean of four studies; robots:total sessions = 0.1516, mean of four studies) OA journal (40% robot) HUNTINGTON, P., NICHOLAS, D. & JAMALI, H. R. 2008. Web robot detection in the scholarly information environment. Journal of Information Science, 34, 726-741. OA repositories (85% robot) Greene 2016 and Information Power 2013 (see above) Internet Archive (91% robot) ALNOAMANY, Y., WEIGLE, M. C. & NELSON, M. L. 2013. Access patterns for robots and humans in web archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 339-348. Reverse is also true: fail to catch robots (e.g. deterioration over time as robots improve their capabilities), accuracy of stats diminishes Formula: Greene 2016 𝐏𝐢𝐧𝐯 = 𝐓𝐑(𝐑−𝐏𝐑−𝟏)+𝟐𝐓𝐏𝐑−𝐏(𝐓+𝐑−𝟏) 𝐑(𝐓𝐑−𝐏−𝐓)+𝐏 R = recall (robot detection) P = precision (robot detection) Pinv = inverse precision (human stats) T = ratio of robots to total
  7. Greene 2016