SlideShare une entreprise Scribd logo
1  sur  21
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
#iCanHazRobot?
Improved robot detection for IR usage statistics
Open Repositories 2016
Dublin, 14 June
Overview and take-home points
• Usage stats are important
– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)
• Robot filtration is a problem, especially in
repositories
• Robot detection has an exponential effect on
usage stats’ accuracy in repositories
• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Applied DSpace, EPrints robot detection
algorithms to the dataset
– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs
– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered
repository downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1
• Recall: proportion of robots detected
– I can haz robot?
• Precision: true positives in robot detection
– Proportion of discounted downloads that are
actually made by robots (sometimes humans are
counted as robots)
• Accuracy of download stats measured as inverse
precision:
– Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
checking
No robot detection
Robot detection in OA IR systems
Recall Precision Negative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:
– Daily downloads for the last 2-4 months
– Top 10 most downloaded items
– Top 20 downloading IP addresses for the last 2-4
months
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
Out-…
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
2. Recalibrate the EPrints repeat-
download (double-click) filter
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots) Precision (accuracy
of excluded
downloads)
Inverse recall
(legitimate
downloads
accounted for in
stats)
Inverse precision
(accuracy of
reported download
stats)
Overall accuracy
Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑 + 𝑻𝒏
𝒏
3. Port Minho’s robot detection code (a
log parser) onto DSpace or EPrints
• 1 Java class
• Input is Apache Combined Log Format
• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm,
– Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document
• Requires 2 database tables to store learned
agents and IPs
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box With Minho log parser
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho
Robots caught
(Recall)
Out-of-the-box
With manual
checking (outlier
exclusion)
With recalibrated
double click filter*
With Minho log
parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking
(outlier exclusion)
With recalibrated
double click filter*
With Minho log parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
Thank you!

Contenu connexe

En vedette

Weiying1新生儿
Weiying1新生儿Weiying1新生儿
Weiying1新生儿
Deep Deep
 
Andy warhol . Raul and Gerard
 Andy warhol . Raul and Gerard Andy warhol . Raul and Gerard
Andy warhol . Raul and Gerard
Irisat
 

En vedette (20)

Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...Using a consultancy to assist in developing the UCD vision for the future onl...
Using a consultancy to assist in developing the UCD vision for the future onl...
 
Les possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològicaLes possibilitats d’Internet aplicades a l’agricultura ecològica
Les possibilitats d’Internet aplicades a l’agricultura ecològica
 
CII S'Marketing Convention 2009
CII S'Marketing Convention 2009CII S'Marketing Convention 2009
CII S'Marketing Convention 2009
 
Is peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony EklofIs peer review peerless? Author: Tony Eklof
Is peer review peerless? Author: Tony Eklof
 
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
 
Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.Dades Obertes. El valor del coneixement lliure.
Dades Obertes. El valor del coneixement lliure.
 
OpenGovernment
OpenGovernmentOpenGovernment
OpenGovernment
 
Weiying1新生儿
Weiying1新生儿Weiying1新生儿
Weiying1新生儿
 
Presentation6
Presentation6Presentation6
Presentation6
 
Noms
NomsNoms
Noms
 
Andy warhol . Raul and Gerard
 Andy warhol . Raul and Gerard Andy warhol . Raul and Gerard
Andy warhol . Raul and Gerard
 
Курсовая работа
Курсовая работаКурсовая работа
Курсовая работа
 
Dynasties
DynastiesDynasties
Dynasties
 
Loex 2008 (P2)
Loex 2008 (P2)Loex 2008 (P2)
Loex 2008 (P2)
 
On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....On the shelf in time : developing a strategy to improve reading list support....
On the shelf in time : developing a strategy to improve reading list support....
 
Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...Resource description and new media : challenges and opportunities. Authors: E...
Resource description and new media : challenges and opportunities. Authors: E...
 
The library as place. Author: Peter Hickey
The library as place. Author: Peter HickeyThe library as place. Author: Peter Hickey
The library as place. Author: Peter Hickey
 
Pharmacy Businesslaw2
Pharmacy Businesslaw2Pharmacy Businesslaw2
Pharmacy Businesslaw2
 
Graphis Feature
Graphis FeatureGraphis Feature
Graphis Feature
 
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
 

Similaire à #iCanHazRobot?: improved robot detection for IR usage statistics

Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Pete Burnap
 
Technical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot AnalysisTechnical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot Analysis
Positive Hack Days
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
Nish Parikh
 

Similaire à #iCanHazRobot?: improved robot detection for IR usage statistics (20)

Developing COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access ResourcesDeveloping COUNTER Standards to Measure the Use of Open Access Resources
Developing COUNTER Standards to Measure the Use of Open Access Resources
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Building and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility AnalyticsBuilding and Measuring Privacy-Preserving Mobility Analytics
Building and Measuring Privacy-Preserving Mobility Analytics
 
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an..."Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
 
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability3 Pitfalls Everyone Should Avoid with Cloud Native Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
 
2015 moloch recipes
2015 moloch recipes2015 moloch recipes
2015 moloch recipes
 
BotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the InternetBotMagnifier: Locating Spambots on the Internet
BotMagnifier: Locating Spambots on the Internet
 
PhD Symposium 2014
PhD Symposium 2014PhD Symposium 2014
PhD Symposium 2014
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Technical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot AnalysisTechnical Workshop - Win32/Georbot Analysis
Technical Workshop - Win32/Georbot Analysis
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
A Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in PythonA Fast, Offline Reverse Geocoder in Python
A Fast, Offline Reverse Geocoder in Python
 

Plus de UCD Library

Plus de UCD Library (20)

The role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrityThe role of academic libraries in supporting a culture of research integrity
The role of academic libraries in supporting a culture of research integrity
 
Collection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD LibraryCollection Management and GreenGlass at UCD Library
Collection Management and GreenGlass at UCD Library
 
The authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA HumanitiesThe authentic research experience: UCD Special Collections in the BA Humanities
The authentic research experience: UCD Special Collections in the BA Humanities
 
Show and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and educationShow and teach: the role of exhibitions in outreach and education
Show and teach: the role of exhibitions in outreach and education
 
Print to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital LibraryPrint to pixels: digitised periodical collections in UCD Digital Library
Print to pixels: digitised periodical collections in UCD Digital Library
 
Appearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishersAppearances can be deceiving: how to avoid 'predatory' publishers
Appearances can be deceiving: how to avoid 'predatory' publishers
 
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for Researchers
 
Going Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in ChinaGoing Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experience of Teaching Information Literacy in China
 
Going Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in ChinaGoing Global: UCD Library's Experiences in China
Going Global: UCD Library's Experiences in China
 
Clifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an OverviewClifden Arts Festival Archive@UCD: an Overview
Clifden Arts Festival Archive@UCD: an Overview
 
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...UCD Digital Library: Creating Digitised Content from Archival Collections - P...
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
 
Optimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital LibraryOptimising Workflows for Digital Archives: UCD Digital Library
Optimising Workflows for Digital Archives: UCD Digital Library
 
Creating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital CollectionCreating the Collected Letters of Nano Nagle Digital Collection
Creating the Collected Letters of Nano Nagle Digital Collection
 
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
 
Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...Enhancing User Engagement and Experiences through the Development of UCD Libr...
Enhancing User Engagement and Experiences through the Development of UCD Libr...
 
UCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining CollectionsUCD Library and GreenGlass: Defining Needs, Redefining Collections
UCD Library and GreenGlass: Defining Needs, Redefining Collections
 
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
 
Pin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locationsPin It! Linking shelf-marks to shelf locations
Pin It! Linking shelf-marks to shelf locations
 
Real Life Digital Curation and Preservation
Real Life Digital Curation and PreservationReal Life Digital Curation and Preservation
Real Life Digital Curation and Preservation
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

#iCanHazRobot?: improved robot detection for IR usage statistics

  • 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June
  • 2. Overview and take-home points • Usage stats are important – (go to the Usage Stats panel on Thursday, 16/Jun/2016: 11:00am - 12:30pm) • Robot filtration is a problem, especially in repositories • Robot detection has an exponential effect on usage stats’ accuracy in repositories • 2-3 ways to improve DSpace and EPrints’ usage stats by 20% or more will be demonstrated
  • 3. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Applied DSpace, EPrints robot detection algorithms to the dataset – This is an EXPERIMENT, simulating algorithms on a DSpace repository’s usage data and Apache logs – The data is real, live data, and the algorithms were very easy to simulate
  • 4. First finding 85% of unfiltered repository downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  • 5. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  • 6. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  • 7. Measurements used in robot detection • All measurements are a number between 0 and 1 • Recall: proportion of robots detected – I can haz robot? • Precision: true positives in robot detection – Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots) • Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by humans
  • 8. How they perform, out-of-the-box 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)
  • 10. 1. Ability to manually check for outliers • At UCD, once a month, we check: – Daily downloads for the last 2-4 months – Top 10 most downloaded items – Top 20 downloading IP addresses for the last 2-4 months
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) Out-… 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion)
  • 16. 2. Recalibrate the EPrints repeat- download (double-click) filter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (robots) Precision (accuracy of excluded downloads) Inverse recall (legitimate downloads accounted for in stats) Inverse precision (accuracy of reported download stats) Overall accuracy Effect of double-click filter on EPrints’ robot detection and stats Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter* 𝑻𝒑 + 𝑻𝒏 𝒏
  • 17. 3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints • 1 Java class • Input is Apache Combined Log Format • Output is a database update (robot = true field) – Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot' field in the SOLR usage events document • Requires 2 database tables to store learned agents and IPs
  • 18. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With Minho log parser
  • 19. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Robots caught (Recall) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*
  • 20. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*

Notes de l'éditeur

  1. Good news: DSpace and EPrints do robot filtration out-of-the-box, bad news: the stats are still quite inaccurate More good news: Improving robot recall has an exponential effect on usage stats accuracy Usage stats: primarily download counts, used heavily in marketing the repository and they provide a measure of ROI both to those who have uploaded them (investment of time/effort) and to those who fund the repository. More downloads = more UCD visibility – one measure of our ROI.
  2. Experiment: simple random sample of 2 years of download data (n=341, N=3.3 million for 96.20% certainty), manually checked to determine if robot or human. DSpace 1.8.2 with U. Minho DSpace Statistics Add-on v. 4. Apache Tomcat behind Apache HTTP server; logs in Apache Combined Log Format. Minho registers every download in the PostgreSQL database. Results to be published in July 2016 issue of Library Hi Tech (Greene 2016) This dataset is used to experimentally test different detection techniques used alone and in combination Weaknesses: The data is taken from a DSpace/Minho system (it's own SEO, it's own way of being crawled, etc.) 'In vitro': Except for the original system (DSpace/Minho + monthly manual outlier checking), the robot detection techniques are simulated. Hence, EXPERIMENTAL Strengths: 'In vivo': the data is real data from a production OA IR system Simulating the various detection techniques was very easy to do, so is probably a very accurate picture of how each system would have treated this dataset
  3. See: INFORMATION POWER LTD. 2013. IRUS download data: identifying unusual usage [Online]. Available: http://www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf [Accessed 2015-12-11. Confirms 85% figure DORAN, D. & GOKHALE, S. S. 2011. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210. Hypothesizes why so high in OA (p.191)
  4. Typical website (15% robot traffic) (precision = 0.8727, mean of four studies; robots:total sessions = 0.1516, mean of four studies) OA journal (40% robot) HUNTINGTON, P., NICHOLAS, D. & JAMALI, H. R. 2008. Web robot detection in the scholarly information environment. Journal of Information Science, 34, 726-741. OA repositories (85% robot) Greene 2016 and Information Power 2013 (see above) Internet Archive (91% robot) ALNOAMANY, Y., WEIGLE, M. C. & NELSON, M. L. 2013. Access patterns for robots and humans in web archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 339-348. Reverse is also true: fail to catch robots (e.g. deterioration over time as robots improve their capabilities), accuracy of stats diminishes Formula: Greene 2016 𝐏𝐢𝐧𝐯 = 𝐓𝐑(𝐑−𝐏𝐑−𝟏)+𝟐𝐓𝐏𝐑−𝐏(𝐓+𝐑−𝟏) 𝐑(𝐓𝐑−𝐏−𝐓)+𝐏 R = recall (robot detection) P = precision (robot detection) Pinv = inverse precision (human stats) T = ratio of robots to total
  5. Greene 2016
  6. Minho with monthly manual checking is the original data as measured in vivo. Minho alone has detected manual outliers removed. DSpace and EPrints have been generated by applying their native algorithms to the data.
  7. Outliers: c.f. LAMOTHE, A. R. 2014. The importance of identifying and accommodating e-resource usage data for the presence of outliers. Information Technology and Libraries, 33, 31-44.
  8. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period See also: JOINT, N., FIELD, A. & GREGSON, M. 2011. Please change the way IRstats works [Online]. Available: http://www.eprints.org/tech.php/15695.html [Accessed November 28 2015]. The drop in inverse recall (loss of legitimate downloads) supports the concern raised in this email discussion. However, if the recalibration were to be implemented, the improvement to robot precision means that the increase in legitimate downloads is offset by the decrease in illegitimate ones, so inverse precision is not affected a great deal. Overall accuracy improves notably however.
  9. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period
  10. *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period