SlideShare a Scribd company logo
1 of 35
Download to read offline
Code Mining Comment extraire et exploiter
l'information contenue dans du
code source ?
Juliette Tisseyre
Juliette Tisseyre
Software Engineer in Code Mining
Lead R&D
Margo
juliette.tisseyre@margo-group.com
www.margo-group.com
@zanoellia #CodeMining
Everybody needs to access the knowledge
to learn, explain, control, decide, monetise…
But the knowledge is not only described in natural languages.
You can extract the knowledge from a less conventional text:
the source code
| 32018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Everybody needs to access the knowledge
to learn, explain, control, decide, monetise…
But the knowledge is not only described in natural languages.
You can extract the knowledge from a less conventional text:
the source code
3,000 billions of running lines of code in the world
| 42018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
DEVELOPERS ARE NOT EQUIPPED ENOUGH
| 52018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
WELCOME TO CODE MINING FIELD!
Text mining ➙ have a machine understand text
Code mining ➙ have a human being understand code
CodeMining: extraction of knowledge from source code
| 62018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
AGENDA
1
Code as DATA
Anatomy of a non conventional
kind of document
Technical approaches
3 global approaches:
formal, natural and hybrid
2
Applications
Let’s start to play for real
3
| 72018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
1
Code as DATA Technical approaches
2
Applications
3
Anatomy of a non conventional
kind of document
STRUCTURE OF CODE SOURCE
Document Document
Chapter Class
Section
Method
Paragraph
Bloc
Sentence
Instruction
Word
(Key)word
| 92018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
CODE SOURCE’S DUALITY
View as a
programmer
View as a
machine
| 102018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
DATA CLEANING
What do we need to keep?
JAVA
PYTHON
Same business logic “open a file” but nothing alike
| 112018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
DATA CLEANING
What is relevant or not in the source code?
➔ Generated code
➔ Technical frameworks
➔ Comments and names
➔ Useful code vs meaningless code
| 122018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
DATA CLEANING
What is relevant or not in the source code?
➔ Generated code
➔ Technical frameworks
➔ Comments and names
➔ Useful code vs meaningless code
Not a trivial task, depends on the objective
● Balance between cleaning and information loss
● Code structure and coding conventions can help to make a choice
| 132018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
1
Code as DATA Technical approaches
2
Applications
3
3 global approaches:
formal, natural and hybrid
GLOBAL PROCESS
Before applying smart algorithms, the source code
must be transformed into a model
● Business logic extraction, classification
● Automated migration / translation
● Search and indexing
● Detection of (anti) pattern or similarity
● Summary, algorithm visualisation
code
Reverse
engine model
| 152018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
FORMAL APPROACH
| 162018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
● Treat code as a structure, no interest in
naming and comments
● Based on programming language
grammars:
● Set of well defined and unambiguous
lexical, syntactic and semantic rules
● Modelisation as AST or graph
FORMAL APPROACH
Good starting point:
● Only few ambiguities thanks to internal relationship knowledge
● Existing tools and algorithms for graph analysis
But tough limitations:
● Unable to understand the meaning
● Poor results on business logic extraction
| 172018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
NATURAL APPROACH
● Treat code as simple text
● Extract natural language elements
● Name of code entities (variables, functions…)
● Comments, string content.
● Reuse of Text Mining techniques
| 182018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
NATURAL APPROACH
Datasets
● Construction of training
datasets is tricky
● Human subjectivity
● Open source vs
corporate code
● Volume, access
Similar challenges as
Text Mining
Powerful level of analysis: enable business logic extraction but...
● Unable to solve all ambiguities
● Infinite vocabulary
● Strong noise
● Not always understandable
for a human
● Mix of languages can occur
Various results
● Very poor results for
code transformation
● Dependant on the
code’s quality
| 192018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
HYBRID APPROACH
What if we take the best of the two worlds?
transformed into
Bottom up process:
● Rely on the code structure
● Text mining techniques to consolidate meaning
| 202018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
21
Hybrid approach: early stage... ...amazing lands yet to be explored!
1
Code as DATA Technical approaches
2
Applications
3
Let’s start to play for real
APPLICATION EXAMPLE 1
| 232018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
code2vec: Learning Distributed
Representations of Code
https://code2vec.org/
CODE 2 VEC
Learning Distributed Representations of Code
“A neural model for representing snippets of code as
continuous distributed vectors (code embeddings)”
| 242018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
3 methods with similar syntactic but different business logic
Impossibility to predict without the hybrid approach
CODE 2 VEC
Learning Distributed Representations of Code
| 252018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
CODE 2 VEC
Learning Distributed Representations of Code
| 262018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
APPLICATION EXAMPLE 2
| 272018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Refactoring toolkit
by MARGO
WHAT IS REFACTORING?
Refactoring is the common process of evolving software design
while preserving the behavior
Refactoring is deeply integrated within dev processes
20%+ of development time is dedicated to refactoring
Time is is wasted through refactoring
Risky 1/3 of the manually performed refactorings insert defects
Same tedious refactorings all and all over again
| 282018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
REFACTORING TOOL, BY MARGO
MARGO’s refactoring tool relies on Code Mining techniques to perform
repetitive and monotonous refactoring tasks on behalf of the programmer.
Next level of automation
for the developper
| 292018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
code
Reverse
engine model
Generator
engine code
Generation into the same language
… or into another language
Refactoring
engine
Java, C#,
C/C++ …
Workflow of our engines
| 302018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Code Smells
Catalogue
model modelDetector Remediator
Refactorings
Catalogue
ApplicatorList of code
smells
List of
refactoring
(plan)
HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
Workflow of our Refactoring engine
| 312018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Code Smells
Catalogue
model modelDetector Remediator
Refactorings
Catalogue
ApplicatorList of code
smells
List of
refactoring
(plan)
Refactoring engine
HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
Workflow of our Refactoring engine
| 322018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
java-design-patterns project
● Examples of design pattern implementations
● Open source: https://github.com/iluwatar/java-design-patterns
● 14 k LoC Java
● ~ 600 classes
Refactoring has been applied on one smell:
● Hard Coded String
RESULTS EXAMPLE
Refactoring of java-design-patterns
Manual42 min
IDE20 min
Margo5
min
8 hours
5 hours
5 min
600 classes1 class
| 332018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
NEXT STEPS
Welcome in the era of the
augmented developper
| 342018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Code mining : comment extraire et exploiter l’information détenue dans du code source ? Juliette Tisseyre - Data Jobs 2018

More Related Content

Similar to Code mining : comment extraire et exploiter l’information détenue dans du code source ? Juliette Tisseyre - Data Jobs 2018

The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...Shift Conference
 
Engaging Tech Students Past Traditional Hands On
Engaging Tech Students Past Traditional Hands OnEngaging Tech Students Past Traditional Hands On
Engaging Tech Students Past Traditional Hands OnFiras Obeid
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldCheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldRehgan Avon
 
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...ITCamp
 
‘Cloud Ready’ Design through Application Software Infrastructure
‘Cloud Ready’ Design through Application Software Infrastructure‘Cloud Ready’ Design through Application Software Infrastructure
‘Cloud Ready’ Design through Application Software InfrastructureFlorin Coros
 
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)Pedro Moreira da Silva
 
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen LudlowAIIM International
 
Refocus on the agile developer
Refocus on the agile developerRefocus on the agile developer
Refocus on the agile developerSandor Dargo
 
CompTIA powered Cybersecurity Apprenticeships
CompTIA powered Cybersecurity ApprenticeshipsCompTIA powered Cybersecurity Apprenticeships
CompTIA powered Cybersecurity ApprenticeshipsZeshan Sattar
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examplesLuciano Resende
 
Inventurist fast track adoption of ai innovations shared.pptx
Inventurist fast track adoption of ai innovations shared.pptxInventurist fast track adoption of ai innovations shared.pptx
Inventurist fast track adoption of ai innovations shared.pptxCirrus Shakeri
 
The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017OW2
 
Get Into Open Source
Get Into Open SourceGet Into Open Source
Get Into Open SourceJoe Sepi
 
OneBot: A Comprehensive Case Study on Enterprise Digital Assistants
OneBot: A Comprehensive Case Study on Enterprise Digital AssistantsOneBot: A Comprehensive Case Study on Enterprise Digital Assistants
OneBot: A Comprehensive Case Study on Enterprise Digital AssistantsSoham Dasgupta
 
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo Schapiro
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo SchapiroThe Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo Schapiro
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo SchapiroSchlomo Schapiro
 
stackconf 2022: The Role of GitOps in IT Strategy
stackconf 2022: The Role of GitOps in IT Strategystackconf 2022: The Role of GitOps in IT Strategy
stackconf 2022: The Role of GitOps in IT StrategyNETWAYS
 
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraft
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraftUXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraft
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraftUXDXConf
 
ITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AIITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AIITCamp
 

Similar to Code mining : comment extraire et exploiter l’information détenue dans du code source ? Juliette Tisseyre - Data Jobs 2018 (20)

IoT solutions world congress 2018 review - Robbrecht van Amerongen - Conclusi...
IoT solutions world congress 2018 review - Robbrecht van Amerongen - Conclusi...IoT solutions world congress 2018 review - Robbrecht van Amerongen - Conclusi...
IoT solutions world congress 2018 review - Robbrecht van Amerongen - Conclusi...
 
The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...
 
Engaging Tech Students Past Traditional Hands On
Engaging Tech Students Past Traditional Hands OnEngaging Tech Students Past Traditional Hands On
Engaging Tech Students Past Traditional Hands On
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldCheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial World
 
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...
ITCamp 2018 - Florin Coros - ‘Cloud Ready’ Design through Application Softwar...
 
‘Cloud Ready’ Design through Application Software Infrastructure
‘Cloud Ready’ Design through Application Software Infrastructure‘Cloud Ready’ Design through Application Software Infrastructure
‘Cloud Ready’ Design through Application Software Infrastructure
 
Industry 4.0 … Rewind
Industry 4.0 … RewindIndustry 4.0 … Rewind
Industry 4.0 … Rewind
 
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
GitLab: One Tool for Software Development (2018-02-06 @ SEIUM, Braga, Portugal)
 
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow
[AIIM18] Automation and Integration (Not Rip and Replace) - Stephen Ludlow
 
Refocus on the agile developer
Refocus on the agile developerRefocus on the agile developer
Refocus on the agile developer
 
CompTIA powered Cybersecurity Apprenticeships
CompTIA powered Cybersecurity ApprenticeshipsCompTIA powered Cybersecurity Apprenticeships
CompTIA powered Cybersecurity Apprenticeships
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Inventurist fast track adoption of ai innovations shared.pptx
Inventurist fast track adoption of ai innovations shared.pptxInventurist fast track adoption of ai innovations shared.pptx
Inventurist fast track adoption of ai innovations shared.pptx
 
The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017
 
Get Into Open Source
Get Into Open SourceGet Into Open Source
Get Into Open Source
 
OneBot: A Comprehensive Case Study on Enterprise Digital Assistants
OneBot: A Comprehensive Case Study on Enterprise Digital AssistantsOneBot: A Comprehensive Case Study on Enterprise Digital Assistants
OneBot: A Comprehensive Case Study on Enterprise Digital Assistants
 
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo Schapiro
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo SchapiroThe Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo Schapiro
The Role of GitOps in IT-Strategy v2 - July 2022 - Schlomo Schapiro
 
stackconf 2022: The Role of GitOps in IT Strategy
stackconf 2022: The Role of GitOps in IT Strategystackconf 2022: The Role of GitOps in IT Strategy
stackconf 2022: The Role of GitOps in IT Strategy
 
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraft
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraftUXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraft
UXDX Berlin - Test & Deploy, by Quentin Berder, President, WiredCraft
 
ITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AIITCamp 2019 - Andy Cross - Business Outcomes from AI
ITCamp 2019 - Andy Cross - Business Outcomes from AI
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Code mining : comment extraire et exploiter l’information détenue dans du code source ? Juliette Tisseyre - Data Jobs 2018

  • 1. Code Mining Comment extraire et exploiter l'information contenue dans du code source ? Juliette Tisseyre
  • 2. Juliette Tisseyre Software Engineer in Code Mining Lead R&D Margo juliette.tisseyre@margo-group.com www.margo-group.com @zanoellia #CodeMining
  • 3. Everybody needs to access the knowledge to learn, explain, control, decide, monetise… But the knowledge is not only described in natural languages. You can extract the knowledge from a less conventional text: the source code | 32018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 4. Everybody needs to access the knowledge to learn, explain, control, decide, monetise… But the knowledge is not only described in natural languages. You can extract the knowledge from a less conventional text: the source code 3,000 billions of running lines of code in the world | 42018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 5. DEVELOPERS ARE NOT EQUIPPED ENOUGH | 52018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 6. WELCOME TO CODE MINING FIELD! Text mining ➙ have a machine understand text Code mining ➙ have a human being understand code CodeMining: extraction of knowledge from source code | 62018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 7. AGENDA 1 Code as DATA Anatomy of a non conventional kind of document Technical approaches 3 global approaches: formal, natural and hybrid 2 Applications Let’s start to play for real 3 | 72018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 8. 1 Code as DATA Technical approaches 2 Applications 3 Anatomy of a non conventional kind of document
  • 9. STRUCTURE OF CODE SOURCE Document Document Chapter Class Section Method Paragraph Bloc Sentence Instruction Word (Key)word | 92018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 10. CODE SOURCE’S DUALITY View as a programmer View as a machine | 102018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 11. DATA CLEANING What do we need to keep? JAVA PYTHON Same business logic “open a file” but nothing alike | 112018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 12. DATA CLEANING What is relevant or not in the source code? ➔ Generated code ➔ Technical frameworks ➔ Comments and names ➔ Useful code vs meaningless code | 122018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 13. DATA CLEANING What is relevant or not in the source code? ➔ Generated code ➔ Technical frameworks ➔ Comments and names ➔ Useful code vs meaningless code Not a trivial task, depends on the objective ● Balance between cleaning and information loss ● Code structure and coding conventions can help to make a choice | 132018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 14. 1 Code as DATA Technical approaches 2 Applications 3 3 global approaches: formal, natural and hybrid
  • 15. GLOBAL PROCESS Before applying smart algorithms, the source code must be transformed into a model ● Business logic extraction, classification ● Automated migration / translation ● Search and indexing ● Detection of (anti) pattern or similarity ● Summary, algorithm visualisation code Reverse engine model | 152018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 16. FORMAL APPROACH | 162018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre ● Treat code as a structure, no interest in naming and comments ● Based on programming language grammars: ● Set of well defined and unambiguous lexical, syntactic and semantic rules ● Modelisation as AST or graph
  • 17. FORMAL APPROACH Good starting point: ● Only few ambiguities thanks to internal relationship knowledge ● Existing tools and algorithms for graph analysis But tough limitations: ● Unable to understand the meaning ● Poor results on business logic extraction | 172018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 18. NATURAL APPROACH ● Treat code as simple text ● Extract natural language elements ● Name of code entities (variables, functions…) ● Comments, string content. ● Reuse of Text Mining techniques | 182018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 19. NATURAL APPROACH Datasets ● Construction of training datasets is tricky ● Human subjectivity ● Open source vs corporate code ● Volume, access Similar challenges as Text Mining Powerful level of analysis: enable business logic extraction but... ● Unable to solve all ambiguities ● Infinite vocabulary ● Strong noise ● Not always understandable for a human ● Mix of languages can occur Various results ● Very poor results for code transformation ● Dependant on the code’s quality | 192018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 20. HYBRID APPROACH What if we take the best of the two worlds? transformed into Bottom up process: ● Rely on the code structure ● Text mining techniques to consolidate meaning | 202018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 21. 21 Hybrid approach: early stage... ...amazing lands yet to be explored!
  • 22. 1 Code as DATA Technical approaches 2 Applications 3 Let’s start to play for real
  • 23. APPLICATION EXAMPLE 1 | 232018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre code2vec: Learning Distributed Representations of Code https://code2vec.org/
  • 24. CODE 2 VEC Learning Distributed Representations of Code “A neural model for representing snippets of code as continuous distributed vectors (code embeddings)” | 242018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 25. 3 methods with similar syntactic but different business logic Impossibility to predict without the hybrid approach CODE 2 VEC Learning Distributed Representations of Code | 252018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 26. CODE 2 VEC Learning Distributed Representations of Code | 262018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 27. APPLICATION EXAMPLE 2 | 272018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre Refactoring toolkit by MARGO
  • 28. WHAT IS REFACTORING? Refactoring is the common process of evolving software design while preserving the behavior Refactoring is deeply integrated within dev processes 20%+ of development time is dedicated to refactoring Time is is wasted through refactoring Risky 1/3 of the manually performed refactorings insert defects Same tedious refactorings all and all over again | 282018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 29. REFACTORING TOOL, BY MARGO MARGO’s refactoring tool relies on Code Mining techniques to perform repetitive and monotonous refactoring tasks on behalf of the programmer. Next level of automation for the developper | 292018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 30. HOW AUTOMATION THROUGH MARGO’S TOOL WORKS code Reverse engine model Generator engine code Generation into the same language … or into another language Refactoring engine Java, C#, C/C++ … Workflow of our engines | 302018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 31. Code Smells Catalogue model modelDetector Remediator Refactorings Catalogue ApplicatorList of code smells List of refactoring (plan) HOW AUTOMATION THROUGH MARGO’S TOOL WORKS Workflow of our Refactoring engine | 312018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 32. Code Smells Catalogue model modelDetector Remediator Refactorings Catalogue ApplicatorList of code smells List of refactoring (plan) Refactoring engine HOW AUTOMATION THROUGH MARGO’S TOOL WORKS Workflow of our Refactoring engine | 322018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 33. java-design-patterns project ● Examples of design pattern implementations ● Open source: https://github.com/iluwatar/java-design-patterns ● 14 k LoC Java ● ~ 600 classes Refactoring has been applied on one smell: ● Hard Coded String RESULTS EXAMPLE Refactoring of java-design-patterns Manual42 min IDE20 min Margo5 min 8 hours 5 hours 5 min 600 classes1 class | 332018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
  • 34. NEXT STEPS Welcome in the era of the augmented developper | 342018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre