SlideShare une entreprise Scribd logo
1  sur  18
Bot or Not?
Detecting bots in GitHub
pull request activity based
on comment similarity
-- Mehdi Golzadeh
-- Damien Legay
-- Alexandre Decan
-- Tom Mens
---------------------------------------
{firstname.lastname}@umons.ac.be
Software engineering lab - Univerisity of Mons
2nd International Workshop on Bots in Software Engineering
May 24th, 2020, Seoul, South Korea In conjunction with ICSE 2020
> Introduction
> Online distributed social coding platforms such as
GitHub
-- Commit changes
-- Submit pull requests
-- Code reviews
> Projects are increasingly relying on bots
-- Carry out routine tasks
-- Interact with project members
> Bots have been shown to improve collaborative
development
-- Improving software quality by automated refactoring {Wyrich2019}
-- Generating patches for bugs {Monperrus2019}
-- Continuous integration {Ablett2007}
-- Automatically closing abandoned issues and pull requests {Wessel2019}
1/13
> What is a pull request?
-- A method of code contribution
-- Participate in a project without having direct commit access
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Problems
> Empirical studies about human behaviour
-- Distinguish human behaviour from automated behaviour {Liu2019}
-- Identity merging {GoeminneComparison2012}
-- Contributor onboarding {Casalnuovo2015}
-- Contributor abandonment {Constantinou2017ISSE}
-- Team productivity {Meyer2014,Vasilescu2015}
-- Team collaboration {Tamburri2019}
Proportion of accepted pull requests with and without presence of bots in commenting activity
3/13
> Dataset
> Current methods to identify bots on GitHub
-- A manual and effort-prone process
-- Not easily reproducible
-- Presence of the string "bot" in the user identity
> Create ground truth
-- Accounts that are currently active in cargo ecosystem
-- Two distinct authors of this paper manually and independently verified all
identities
-----------------------------------------------
| | Humans | | Total |
-----------------------------------------------
| GitHub identities | 250 | 42 | 292 |
-----------------------------------------------
| PR comments | 16,430 | 3,360 | 20,090 |
-----------------------------------------------
| GitHub repositories | 692 | 694 | 1,262 |
-----------------------------------------------
4/13
> Underlying Idea
> An automated approach for detecting bots in GitHub
-- Based on the similarity of comments
5/13
> Approach
> Measure the similarity between comments
> Levenshtein edit distance
-- Works more on structure
-- Counting the number of single-character edits
> Jaccard distance
-- Similarity of content
-- Compares words in two strings
-- We computed the Levenshtein and Jaccard distance of all pairs of PR messages
C1 C2 C3
-------------------------
C1 | 0 | | |
-------------------------
C2 | 0.42 | 0 | |
-------------------------
C3 | 0.63 | 0.2 | 0 |
-------------------------
6/13
> Results
-- humans have higher mean distances than bots
-- very similar results but not redundant
-- some bots can be detected by one and not with another
7/13
> Idea behind clustering
> Pairwise similarity does not work for some bots
-- different message patterns
8/13
c c
c
c c
c
c
c
> PR comment clustering
> Clusters of comments
-- Messages follow several distinct patterns
-- Human comments spread across many small clusters
-- Bots have a small number of large clusters
> DBSCAN
-- Specifying the number of expected clusters
-- Generate clusters of unequal size
-- Allows clusters to contain a single item
-- Combined Levenshtein-Jaccard distance
9/13
> Approach and results
> Clusters of comments
-- Bots have a lower number of clusters regardless of the number of messages
-- All 42 bots have 10 clusters or less
-- 249 out of the 250 humans have at least 14 clusters
-- Classifying bots based on a threshold of 10 clusters
-- Recall of 100% and accuracy of 97.7%
10/13
> Limitations and Future work
> Limitations
-- Only 292 GitHub identities from Cargo
-- Clusters of comments is the only decision factor
> Future work
-- Already manually verified 5,000 accounts from 37 ecosystems
-- Creating a classification model instead of a threshold for clusters
-- Number of comments, empty comments and Gini value as distinguishing metrics
-- Developing a tool
11/13
> Outcome
> Pull-based model and bots
-- Presence of bots in social coding platforms
-- Problem raised by presence of bots
> Our approach and the results
-- Based on comment similarity and clusters
-- Clusters of combination of the Jaccard and Levenshtein distance
-- Our approach classifies accounts with high precision and recall
> The future work
-- Create a bigger dataset
-- Consider more features to make a better model to classify
-- Create a tool to make it useful for the community
12/13
> Ctrl + Z
> Thanks
-- Any question?
13/13joint FNRS / FWO Excellence of Science project SECO-ASSIST
SECO-ASSIST
https://secoassist.github.io

Contenu connexe

Similaire à Bot or not? Detecting bots in GitHub pull request activity based on comment similarity

MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 
Building Your First App with Shawn Mcarthy
Building Your First App with Shawn Mcarthy Building Your First App with Shawn Mcarthy
Building Your First App with Shawn Mcarthy MongoDB
 
Ship code like a keptn
Ship code like a keptnShip code like a keptn
Ship code like a keptnRob Jahn
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton Araf Karsh Hamid
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB
 
Volkswagen | ECU Software Development with codeBeamer ALM: IT Aspects
Volkswagen | ECU Software Development with codeBeamer ALM: IT AspectsVolkswagen | ECU Software Development with codeBeamer ALM: IT Aspects
Volkswagen | ECU Software Development with codeBeamer ALM: IT AspectsIntland Software GmbH
 
Business Process Management with Office 365
Business Process Management with Office 365Business Process Management with Office 365
Business Process Management with Office 365Paul J. Swider
 
Gerrit Analytics applied to Android source code
Gerrit Analytics applied to Android source codeGerrit Analytics applied to Android source code
Gerrit Analytics applied to Android source codeLuca Milanesio
 
Hyperledger Composer architecture
Hyperledger Composer architectureHyperledger Composer architecture
Hyperledger Composer architectureSimon Stone
 
Hyperledger Composer Update 2017-04-05
Hyperledger Composer Update 2017-04-05Hyperledger Composer Update 2017-04-05
Hyperledger Composer Update 2017-04-05Dan Selman
 
Big query - Command line tools and Tips - (MOSG)
Big query - Command line tools and Tips - (MOSG)Big query - Command line tools and Tips - (MOSG)
Big query - Command line tools and Tips - (MOSG)Soshi Nemoto
 
Vue micro frontend implementation patterns
Vue micro frontend implementation patternsVue micro frontend implementation patterns
Vue micro frontend implementation patternsAlbert Brand
 
Introduction to GitHub Actions – How to easily automate and integrate with Gi...
Introduction to GitHub Actions – How to easily automate and integrate with Gi...Introduction to GitHub Actions – How to easily automate and integrate with Gi...
Introduction to GitHub Actions – How to easily automate and integrate with Gi...All Things Open
 
Open shift deployment review getting ready for day 2 operations
Open shift deployment review   getting ready for day 2 operationsOpen shift deployment review   getting ready for day 2 operations
Open shift deployment review getting ready for day 2 operationsHendrik van Run
 

Similaire à Bot or not? Detecting bots in GitHub pull request activity based on comment similarity (20)

React Basic Examples
React Basic ExamplesReact Basic Examples
React Basic Examples
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
Building Your First App with Shawn Mcarthy
Building Your First App with Shawn Mcarthy Building Your First App with Shawn Mcarthy
Building Your First App with Shawn Mcarthy
 
Ship code like a keptn
Ship code like a keptnShip code like a keptn
Ship code like a keptn
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 
CodeIgniter Framework
CodeIgniter FrameworkCodeIgniter Framework
CodeIgniter Framework
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
 
Volkswagen | ECU Software Development with codeBeamer ALM: IT Aspects
Volkswagen | ECU Software Development with codeBeamer ALM: IT AspectsVolkswagen | ECU Software Development with codeBeamer ALM: IT Aspects
Volkswagen | ECU Software Development with codeBeamer ALM: IT Aspects
 
Development Workflows on AWS
Development Workflows on AWSDevelopment Workflows on AWS
Development Workflows on AWS
 
Business Process Management with Office 365
Business Process Management with Office 365Business Process Management with Office 365
Business Process Management with Office 365
 
Gerrit Analytics applied to Android source code
Gerrit Analytics applied to Android source codeGerrit Analytics applied to Android source code
Gerrit Analytics applied to Android source code
 
Hyperledger Composer architecture
Hyperledger Composer architectureHyperledger Composer architecture
Hyperledger Composer architecture
 
Hyperledger Composer Update 2017-04-05
Hyperledger Composer Update 2017-04-05Hyperledger Composer Update 2017-04-05
Hyperledger Composer Update 2017-04-05
 
Big query - Command line tools and Tips - (MOSG)
Big query - Command line tools and Tips - (MOSG)Big query - Command line tools and Tips - (MOSG)
Big query - Command line tools and Tips - (MOSG)
 
Basic Rails Training
Basic Rails TrainingBasic Rails Training
Basic Rails Training
 
Intro to GitHub Actions
Intro to GitHub ActionsIntro to GitHub Actions
Intro to GitHub Actions
 
Vue micro frontend implementation patterns
Vue micro frontend implementation patternsVue micro frontend implementation patterns
Vue micro frontend implementation patterns
 
Introduction to GitHub Actions – How to easily automate and integrate with Gi...
Introduction to GitHub Actions – How to easily automate and integrate with Gi...Introduction to GitHub Actions – How to easily automate and integrate with Gi...
Introduction to GitHub Actions – How to easily automate and integrate with Gi...
 
Open shift deployment review getting ready for day 2 operations
Open shift deployment review   getting ready for day 2 operationsOpen shift deployment review   getting ready for day 2 operations
Open shift deployment review getting ready for day 2 operations
 

Plus de Tom Mens

How to be(come) a successful PhD student
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD studentTom Mens
 
Recognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentRecognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentTom Mens
 
A Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubA Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubTom Mens
 
The (r)evolution of CI/CD on GitHub
 The (r)evolution of CI/CD on GitHub The (r)evolution of CI/CD on GitHub
The (r)evolution of CI/CD on GitHubTom Mens
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureTom Mens
 
Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Tom Mens
 
On the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubOn the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubTom Mens
 
On backporting practices in package dependency networks
On backporting practices in package dependency networksOn backporting practices in package dependency networks
On backporting practices in package dependency networksTom Mens
 
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsTom Mens
 
Lost in Zero Space
Lost in Zero SpaceLost in Zero Space
Lost in Zero SpaceTom Mens
 
Evaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesEvaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesTom Mens
 
Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Tom Mens
 
On the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsOn the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsTom Mens
 
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...Tom Mens
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Tom Mens
 
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Tom Mens
 
SecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsSecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsTom Mens
 
SECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarSECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarTom Mens
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersTom Mens
 
ConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker ContainersConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker ContainersTom Mens
 

Plus de Tom Mens (20)

How to be(come) a successful PhD student
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
 
Recognising bot activity in collaborative software development
Recognising bot activity in collaborative software developmentRecognising bot activity in collaborative software development
Recognising bot activity in collaborative software development
 
A Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHubA Dataset of Bot and Human Activities in GitHub
A Dataset of Bot and Human Activities in GitHub
 
The (r)evolution of CI/CD on GitHub
 The (r)evolution of CI/CD on GitHub The (r)evolution of CI/CD on GitHub
The (r)evolution of CI/CD on GitHub
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the Future
 
Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?
 
On the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubOn the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHub
 
On backporting practices in package dependency networks
On backporting practices in package dependency networksOn backporting practices in package dependency networks
On backporting practices in package dependency networks
 
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
 
Lost in Zero Space
Lost in Zero SpaceLost in Zero Space
Lost in Zero Space
 
Evaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesEvaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messages
 
Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!
 
On the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsOn the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystems
 
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)
 
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
 
SecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsSecoHealth 2019 Research Achievements
SecoHealth 2019 Research Achievements
 
SECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarSECO-Assist 2019 research seminar
SECO-Assist 2019 research seminar
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package Managers
 
ConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker ContainersConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker Containers
 

Dernier

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 

Dernier (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

Bot or not? Detecting bots in GitHub pull request activity based on comment similarity

  • 1. Bot or Not? Detecting bots in GitHub pull request activity based on comment similarity -- Mehdi Golzadeh -- Damien Legay -- Alexandre Decan -- Tom Mens --------------------------------------- {firstname.lastname}@umons.ac.be Software engineering lab - Univerisity of Mons 2nd International Workshop on Bots in Software Engineering May 24th, 2020, Seoul, South Korea In conjunction with ICSE 2020
  • 2. > Introduction > Online distributed social coding platforms such as GitHub -- Commit changes -- Submit pull requests -- Code reviews > Projects are increasingly relying on bots -- Carry out routine tasks -- Interact with project members > Bots have been shown to improve collaborative development -- Improving software quality by automated refactoring {Wyrich2019} -- Generating patches for bugs {Monperrus2019} -- Continuous integration {Ablett2007} -- Automatically closing abandoned issues and pull requests {Wessel2019} 1/13
  • 3. > What is a pull request? -- A method of code contribution -- Participate in a project without having direct commit access > Pull-based development 2/13
  • 8. > Problems > Empirical studies about human behaviour -- Distinguish human behaviour from automated behaviour {Liu2019} -- Identity merging {GoeminneComparison2012} -- Contributor onboarding {Casalnuovo2015} -- Contributor abandonment {Constantinou2017ISSE} -- Team productivity {Meyer2014,Vasilescu2015} -- Team collaboration {Tamburri2019} Proportion of accepted pull requests with and without presence of bots in commenting activity 3/13
  • 9. > Dataset > Current methods to identify bots on GitHub -- A manual and effort-prone process -- Not easily reproducible -- Presence of the string "bot" in the user identity > Create ground truth -- Accounts that are currently active in cargo ecosystem -- Two distinct authors of this paper manually and independently verified all identities ----------------------------------------------- | | Humans | | Total | ----------------------------------------------- | GitHub identities | 250 | 42 | 292 | ----------------------------------------------- | PR comments | 16,430 | 3,360 | 20,090 | ----------------------------------------------- | GitHub repositories | 692 | 694 | 1,262 | ----------------------------------------------- 4/13
  • 10. > Underlying Idea > An automated approach for detecting bots in GitHub -- Based on the similarity of comments 5/13
  • 11. > Approach > Measure the similarity between comments > Levenshtein edit distance -- Works more on structure -- Counting the number of single-character edits > Jaccard distance -- Similarity of content -- Compares words in two strings -- We computed the Levenshtein and Jaccard distance of all pairs of PR messages C1 C2 C3 ------------------------- C1 | 0 | | | ------------------------- C2 | 0.42 | 0 | | ------------------------- C3 | 0.63 | 0.2 | 0 | ------------------------- 6/13
  • 12. > Results -- humans have higher mean distances than bots -- very similar results but not redundant -- some bots can be detected by one and not with another 7/13
  • 13. > Idea behind clustering > Pairwise similarity does not work for some bots -- different message patterns 8/13 c c c c c c c c
  • 14. > PR comment clustering > Clusters of comments -- Messages follow several distinct patterns -- Human comments spread across many small clusters -- Bots have a small number of large clusters > DBSCAN -- Specifying the number of expected clusters -- Generate clusters of unequal size -- Allows clusters to contain a single item -- Combined Levenshtein-Jaccard distance 9/13
  • 15. > Approach and results > Clusters of comments -- Bots have a lower number of clusters regardless of the number of messages -- All 42 bots have 10 clusters or less -- 249 out of the 250 humans have at least 14 clusters -- Classifying bots based on a threshold of 10 clusters -- Recall of 100% and accuracy of 97.7% 10/13
  • 16. > Limitations and Future work > Limitations -- Only 292 GitHub identities from Cargo -- Clusters of comments is the only decision factor > Future work -- Already manually verified 5,000 accounts from 37 ecosystems -- Creating a classification model instead of a threshold for clusters -- Number of comments, empty comments and Gini value as distinguishing metrics -- Developing a tool 11/13
  • 17. > Outcome > Pull-based model and bots -- Presence of bots in social coding platforms -- Problem raised by presence of bots > Our approach and the results -- Based on comment similarity and clusters -- Clusters of combination of the Jaccard and Levenshtein distance -- Our approach classifies accounts with high precision and recall > The future work -- Create a bigger dataset -- Consider more features to make a better model to classify -- Create a tool to make it useful for the community 12/13
  • 18. > Ctrl + Z > Thanks -- Any question? 13/13joint FNRS / FWO Excellence of Science project SECO-ASSIST SECO-ASSIST https://secoassist.github.io