Juliette Tisseyre, Lead Software Engineer in R&D chez Margo, a animé un talk intitulé « Code Mining, comment extraire et exploiter l’information contenue dans du code source ? » lors du salon Data Jobs 2018.
En effet, le code est une forme de data un peu particulière et riche d’informations. Pour autant, ces informations sont très hétéroclites tant par leur forme qu’en terme de pertinence. Quelles informations sont récupérables ? Quelles approches choisir pour quelles applications ? Quelles techniques de Machine Learning peuvent-être mises en place ? Au cours de cette présentation, Juliette vous invite à voir le code comme vous n’avez encore jamais pensé le regarder…
Code mining : comment extraire et exploiter l’information détenue dans du code source ? Juliette Tisseyre - Data Jobs 2018
1. Code Mining Comment extraire et exploiter
l'information contenue dans du
code source ?
Juliette Tisseyre
2. Juliette Tisseyre
Software Engineer in Code Mining
Lead R&D
Margo
juliette.tisseyre@margo-group.com
www.margo-group.com
@zanoellia #CodeMining
3. Everybody needs to access the knowledge
to learn, explain, control, decide, monetise…
But the knowledge is not only described in natural languages.
You can extract the knowledge from a less conventional text:
the source code
| 32018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
4. Everybody needs to access the knowledge
to learn, explain, control, decide, monetise…
But the knowledge is not only described in natural languages.
You can extract the knowledge from a less conventional text:
the source code
3,000 billions of running lines of code in the world
| 42018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
5. DEVELOPERS ARE NOT EQUIPPED ENOUGH
| 52018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
6. WELCOME TO CODE MINING FIELD!
Text mining ➙ have a machine understand text
Code mining ➙ have a human being understand code
CodeMining: extraction of knowledge from source code
| 62018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
7. AGENDA
1
Code as DATA
Anatomy of a non conventional
kind of document
Technical approaches
3 global approaches:
formal, natural and hybrid
2
Applications
Let’s start to play for real
3
| 72018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
8. 1
Code as DATA Technical approaches
2
Applications
3
Anatomy of a non conventional
kind of document
9. STRUCTURE OF CODE SOURCE
Document Document
Chapter Class
Section
Method
Paragraph
Bloc
Sentence
Instruction
Word
(Key)word
| 92018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
10. CODE SOURCE’S DUALITY
View as a
programmer
View as a
machine
| 102018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
11. DATA CLEANING
What do we need to keep?
JAVA
PYTHON
Same business logic “open a file” but nothing alike
| 112018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
12. DATA CLEANING
What is relevant or not in the source code?
➔ Generated code
➔ Technical frameworks
➔ Comments and names
➔ Useful code vs meaningless code
| 122018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
13. DATA CLEANING
What is relevant or not in the source code?
➔ Generated code
➔ Technical frameworks
➔ Comments and names
➔ Useful code vs meaningless code
Not a trivial task, depends on the objective
● Balance between cleaning and information loss
● Code structure and coding conventions can help to make a choice
| 132018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
14. 1
Code as DATA Technical approaches
2
Applications
3
3 global approaches:
formal, natural and hybrid
15. GLOBAL PROCESS
Before applying smart algorithms, the source code
must be transformed into a model
● Business logic extraction, classification
● Automated migration / translation
● Search and indexing
● Detection of (anti) pattern or similarity
● Summary, algorithm visualisation
code
Reverse
engine model
| 152018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
16. FORMAL APPROACH
| 162018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
● Treat code as a structure, no interest in
naming and comments
● Based on programming language
grammars:
● Set of well defined and unambiguous
lexical, syntactic and semantic rules
● Modelisation as AST or graph
17. FORMAL APPROACH
Good starting point:
● Only few ambiguities thanks to internal relationship knowledge
● Existing tools and algorithms for graph analysis
But tough limitations:
● Unable to understand the meaning
● Poor results on business logic extraction
| 172018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
18. NATURAL APPROACH
● Treat code as simple text
● Extract natural language elements
● Name of code entities (variables, functions…)
● Comments, string content.
● Reuse of Text Mining techniques
| 182018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
19. NATURAL APPROACH
Datasets
● Construction of training
datasets is tricky
● Human subjectivity
● Open source vs
corporate code
● Volume, access
Similar challenges as
Text Mining
Powerful level of analysis: enable business logic extraction but...
● Unable to solve all ambiguities
● Infinite vocabulary
● Strong noise
● Not always understandable
for a human
● Mix of languages can occur
Various results
● Very poor results for
code transformation
● Dependant on the
code’s quality
| 192018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
20. HYBRID APPROACH
What if we take the best of the two worlds?
transformed into
Bottom up process:
● Rely on the code structure
● Text mining techniques to consolidate meaning
| 202018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
22. 1
Code as DATA Technical approaches
2
Applications
3
Let’s start to play for real
23. APPLICATION EXAMPLE 1
| 232018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
code2vec: Learning Distributed
Representations of Code
https://code2vec.org/
24. CODE 2 VEC
Learning Distributed Representations of Code
“A neural model for representing snippets of code as
continuous distributed vectors (code embeddings)”
| 242018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
25. 3 methods with similar syntactic but different business logic
Impossibility to predict without the hybrid approach
CODE 2 VEC
Learning Distributed Representations of Code
| 252018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
26. CODE 2 VEC
Learning Distributed Representations of Code
| 262018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
27. APPLICATION EXAMPLE 2
| 272018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
Refactoring toolkit
by MARGO
28. WHAT IS REFACTORING?
Refactoring is the common process of evolving software design
while preserving the behavior
Refactoring is deeply integrated within dev processes
20%+ of development time is dedicated to refactoring
Time is is wasted through refactoring
Risky 1/3 of the manually performed refactorings insert defects
Same tedious refactorings all and all over again
| 282018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
29. REFACTORING TOOL, BY MARGO
MARGO’s refactoring tool relies on Code Mining techniques to perform
repetitive and monotonous refactoring tasks on behalf of the programmer.
Next level of automation
for the developper
| 292018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
30. HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
code
Reverse
engine model
Generator
engine code
Generation into the same language
… or into another language
Refactoring
engine
Java, C#,
C/C++ …
Workflow of our engines
| 302018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
31. Code Smells
Catalogue
model modelDetector Remediator
Refactorings
Catalogue
ApplicatorList of code
smells
List of
refactoring
(plan)
HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
Workflow of our Refactoring engine
| 312018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
32. Code Smells
Catalogue
model modelDetector Remediator
Refactorings
Catalogue
ApplicatorList of code
smells
List of
refactoring
(plan)
Refactoring engine
HOW AUTOMATION THROUGH MARGO’S TOOL WORKS
Workflow of our Refactoring engine
| 322018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
33. java-design-patterns project
● Examples of design pattern implementations
● Open source: https://github.com/iluwatar/java-design-patterns
● 14 k LoC Java
● ~ 600 classes
Refactoring has been applied on one smell:
● Hard Coded String
RESULTS EXAMPLE
Refactoring of java-design-patterns
Manual42 min
IDE20 min
Margo5
min
8 hours
5 hours
5 min
600 classes1 class
| 332018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre
34. NEXT STEPS
Welcome in the era of the
augmented developper
| 342018/11/22 Paris DataJob 2018 - CodeMining - Juliette Tisseyre