15/11/13 1
Learning to Generate Pseudo-code
from Source Code
using Statistical Machine Translation
Yusuke Oda
Hiroyuki Fud...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 2
Summary of This Study
● This presentation introduces summa...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 3
Contribution of Pseudo-code
● Pseudo-code aid code reading...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 4
Pseudo-code in This Study
● Line-to-line Assumption
– Each...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 5
Related Work for Sentence Generation
● Rule-based methods ...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 6
Statistical Machine Translation
(SMT)
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 7
Statistical Machine Translation (SMT)
● Key idea: Combinin...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 8
1. Tokenize
if
x
%
5
==
0
:
if if if
2. Select Phrase Pair...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 9
Tree-to-string Machine Translation (T2SMT)
● Use syntax tr...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 10
Translation
Model
Combine
Features
Rule
Extraction
Transl...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 11
Word Alignment
● Making word alignment (token-level relat...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 12
Rule Extraction (PBMT)
● Making word alignment (token-lev...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 13
x % 5 == 0
cmp
binop
x is divisible by 5
x
x
5
5
Rule Ext...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 14
SMT for Pseudo-code Generation
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 15
Requirements for SMT Methods
PBMT T2SMT
● Tokenizer for n...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 16
Problem of AST
• Problem: Mismatching of token nodes.
If
...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 17
Parse-like Tree (1): Head Insertion
1. Insert HEAD leaves...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 18
Parse-like Tree (2): Pruning
1. Insert HEAD leaves (= lab...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 19
Parse-like Tree (3): Simplification
1. Insert HEAD leaves...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 20
Parse-like Tree (4): Final Tree
• Finally, we obtain the ...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 21
Experiments
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 22
Corpus Summaries
● We gathered 2 corpus with different la...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 23
Evaluated Methods
PBMT
Raw-T2SMT
Modified-T2SMT
Phrase-ba...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 24
Evaluation Setting
● We examined 2 points:
Intrinsic eval...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 25
Results: Intrinsic Evaluation
● BLEU and Acceptability ha...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 26
Results: Code Understanding
● Generated pseudo-code can i...
15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 27
Conclusion / Future Works
● Summary:
– Generating natural...
Prochain SlideShare
Chargement dans…5
×

Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

1 620 vues

Publié le

Slides presented in IEEE/ACM ASE2015.

Publié dans : Ingénierie
  • Soyez le premier à commenter

Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

  1. 1. 15/11/13 1 Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation Yusuke Oda Hiroyuki Fudaba Graham Neubig Hideaki Hata Sakriani Sakti Tomoki Toda Satoshi Nakamura IEEE/ACM ASE, November 13, 2015
  2. 2. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 2 Summary of This Study ● This presentation introduces summaries of key techniques used in Pseudogen tool. [Fudaba+2015] ● Goal: – Generating natural language sentences which describe the behavior of each statement in source code. – We call these output sentences "pseudo-code." ● Approach: – Used 2 different frameworks of statistical machine translation (SMT).
  3. 3. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 3 Contribution of Pseudo-code ● Pseudo-code aid code reading for programming beginners. ● Programmers can double-check their code through pseudo-code. Assisting Code Reading Debugging if x / 5 == 0: if x divided by 5 is 0 if x % 5 == 0: Fix Source Code Pseudo code in natural language
  4. 4. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 4 Pseudo-code in This Study ● Line-to-line Assumption – Each statement in source code can be written by one phrase in natural language with same meaning. ● This assumption represents a minimal relationship between programming and natural language. – We ignore more complicated cases so far (e.g. snippets, functions, documents). if x % 5 == 0:(body) y = 'foo' (if...)else:(body) print('bar') if x is divisible by 5, assign a string 'foo' to y. if not, print a string 'bar' to the output stream. Python English (to be generated)
  5. 5. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 5 Related Work for Sentence Generation ● Rule-based methods e.g. [Buse+ '08], [Sridhara+ '10], [Sridhara+ '11], [Moreno+ '13] – Can use detailed information, however requires high cost maintainance. os.print(・) →   print ・ to output streamos.print(・) →   print ・ to output stream msg →   messagemsg →   message print message to output system Search on rule table Combine print message to output system Search on KB Propose Knowledge Base Knowledge Base os.print(msg) print message to output system os.print(msg) print message to output system os.print(msg) os.print(msg) ● Data(IR)-based methods e.g. [Haiduc+ '10], [Eddy+ '13], [Wong+ '13], [Rodeghero+ '14] – Can use large corpora from real wold, however sometimes occurs search error.
  6. 6. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 6 Statistical Machine Translation (SMT)
  7. 7. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 7 Statistical Machine Translation (SMT) ● Key idea: Combining good parts of rule-based and data-based methods. 1. Training: Extract transformation rules between two languages from large corpus. 2. Generating: Search accurate combination of rules for an input data. ● Merit 1. Automated: Most translation rules are automatically obtained. 2. Scalable: Increasing the amount of corpus improve translation quality. ● We used 2 different SMT frameworks: 1. Phrase-based machine translation (PBMT) 2. Tree-to-string machine translation (T2SMT) Corpus Translator Training Generating Source Sentence Target Sentence
  8. 8. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 8 1. Tokenize if x % 5 == 0 : if if if 2. Select Phrase Pairs Phrase-based Machine Translation (PBMT) ● Use token strings to generate output. Python: if x % 5 == 0: English: if x is divisible by 5 4. Synthesize Target Sentence Simple method, we only need tokenizers Cannot capture source structures x x % 5 by 5 == 0 : is divisible 3. Reorder if if x x % 5 by 5 == 0 : is divisible
  9. 9. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 9 Tree-to-string Machine Translation (T2SMT) ● Use syntax trees to generate output. Python: if x % 5 == 0: 1. Parse if : if cmp body binop == 0 % 5x if : if cmp body binop == 0 % 5x if X Y is divisible by Z x 5 X Y Z 2. Select Subtrees Can capture source structures Complicated method, we need tree treatment 3. Synthesize Target Sentence English: if x is divisible by 5
  10. 10. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 10 Translation Model Combine Features Rule Extraction Translation Rules & Stats Phrase-level Relationship Training Process of SMT Methods Source Corpus Target Corpus Making Word Alignment Alignment Token-level Relationship Making Language Model Target Language Model Evaluate Fluency of Output
  11. 11. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 11 Word Alignment ● Making word alignment (token-level relationship) – Using a statistical model. if x % 5 == 0 : if x is divisible by 5 Python English
  12. 12. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 12 Rule Extraction (PBMT) ● Making word alignment (token-level relationship) – Using a statistical model. ● Extract phrase pairs according to aligned words. if x % 5 == 0 : if x is divisible by 5 == 0 : → is divisible x % 5 == → x is divisible by 5 if x → if x % 5 → by 5 5 == 0 → is divisible by 5 ...and so on Python English
  13. 13. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 13 x % 5 == 0 cmp binop x is divisible by 5 x x 5 5 Rule Extraction (T2SMT) ● Given word alignments, tree-to-string rules are extracted according to aligned words and the source parse tree. cmp binop if cmp binop 5x if is divisible by x % == 0 : 5 if + − − X % Y == 0 cmp binop X is divisible by Y
  14. 14. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 14 SMT for Pseudo-code Generation
  15. 15. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 15 Requirements for SMT Methods PBMT T2SMT ● Tokenizer for natural language – Use NLP tools. ● English: Stanford Tokenizer ● Japanese: MeCab ● Tokenizer for natural language – Like as PBMT ● Tokenizer for programming language – Use the tokenizer provided from programming language itself. ● Parser for programming language – Parser should generate parse trees ● Includes all tokens as its leaf nodes to be used for word alignment – But most programming languages provide only AST parser.
  16. 16. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 16 Problem of AST • Problem: Mismatching of token nodes. If Compare BinOp Name %Loadx Num 5 == Num 0 Body id ctx left op right left ops[0] comparators[0] n n test body if x is divisible by 5 ? English – There are redundant nodes. – Some words in natural language are aligned to inner nodes in AST. Our approach Applying simple transformation rules to avoid token mismatching
  17. 17. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 17 Parse-like Tree (1): Head Insertion 1. Insert HEAD leaves (= label of each nodes). If Compare BinOp Name %Loadx Num 5 == Num 0 Body NumNumNameBinOpCompareIf id ctx left op right left ops[0] comparators[0] n n test body HEAD HEAD HEAD HEAD HEAD HEAD
  18. 18. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 18 Parse-like Tree (2): Pruning 1. Insert HEAD leaves (= label of each nodes). 2. Delete redundant nodes. If Compare BinOp Name %Loadx Num 5 == Num 0 Body NumNumNameBinOpCompareIf id ctx left op right left ops[0] comparators[0] n n test body HEAD HEAD HEAD HEAD HEAD HEAD
  19. 19. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 19 Parse-like Tree (3): Simplification 1. Insert HEAD leaves (= label of each nodes). 2. Delete redundant nodes. 3. Integrate some nodes. If Compare BinOp Name %x Num 5 == Num 0NumNumNameIf id left op right left ops[0] comparators[0] n n test HEAD HEAD HEAD HEAD x 5 0
  20. 20. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 20 Parse-like Tree (4): Final Tree • Finally, we obtain the parse-like tree below. If Compare BinOp % ==If left op right left ops[0] comparators[0] test HEAD x 5 0 if x is divisible by 5English
  21. 21. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 21 Experiments
  22. 22. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 22 Corpus Summaries ● We gathered 2 corpus with different language pairs. 1. Python-to-English • Python ... Extracted from Django framework • English ... Handmade by 1 human • Amount ... 18,805 pairs • Usage ... 17,000 for training, 1,805 for evaluation 2. Python-to-Japanese – Python ... Extracted from student code for programming exercise – Japanese ... Handmade by 1 human – Amount ... 722 pairs – Usage ... 10-fold cross varidation (9/10 for training, 1/10 for evaluation)
  23. 23. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 23 Evaluated Methods PBMT Raw-T2SMT Modified-T2SMT Phrase-based Tree-to-string Tree-to-string Token strings generated from tokenize module AST generated from ast module Parse-like tree (AST with transformation rules) Method Framework Input data structure
  24. 24. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 24 Evaluation Setting ● We examined 2 points: Intrinsic evaluation: Translation quality Extrinsic evaluation: Code understanding ● Apply evaluation metrics used in machine translation studies – Automatic evaluation: BLEU – Human evaluation: Acceptability ● Examine our generator in actual task: Python Pseudo code Read Answer Readability ➔ 0 ➔ 1 ➔ 2 ➔ 3 ➔ 4 ➔ 5 Record Time ● Python + no pseudo-code ● Python + generated pseudo-code ● Python + human-written pseudo-code
  25. 25. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 25 Results: Intrinsic Evaluation ● BLEU and Acceptability has the same tendencies: Modified-T2SMT > Raw-T2SMT > PBMT ● Modified-T2SMT method has the best performance in all settings. – 72% of test samples achieve the highest Acceptability (= gramatically correct & fluent) Genaerator BLEU% English Japanese PBMT 25.71 51.67 Raw-T2SMT 49.74 55.66 Modified-T2SMT 54.08 62.88 PBMT Raw-T2SMT Reduced-T2SMT 0% 20% 40% 60% 80% 100% 5 4 3 2 1 CumulativeAcceptability Human Evaluation: Acceptability [Goto et al. 2013] (Python-Japanese) 50% 63% 72% (do not compare scores between English and Japanese) Automatic Evaluation: BLEU [Papineni et al. 2002]
  26. 26. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 26 Results: Code Understanding ● Generated pseudo-code can improve code readability compared with no pseudo-code. ● But reading time increases. – This comes from generation error (oracle pseudo-code decreases reading time). Group Pseudo-code Readability (6-grade Likert) Mean Reading Time [s] Experienced (8 people) No 2.55 41.37 Generated 2.71 46.48 Human-written 3.05 35.65 Inexperienced (6 people) No 1.32 24.99 Generated 1.81 39.52 Human-written 2.10 24.97 Code Readability and Reading Time (Python-Japanese, Modified-T2SMT)
  27. 27. 15/11/13 Copyright (C) 2015 by Yusuke Oda, AHC-Lab, IS, NAIST 27 Conclusion / Future Works ● Summary: – Generating natural language sentence (we call it pseudo-code) from source statements using statistical machine translation (SMT). – For tree-to-string (T2SMT) method, we apply transformation rules to make parse-like tree. ● Results: – SMT can generate acceptable sentences. ● 54% BLEU in English, 62% BLEU and 72% highest Acceptability in Japanese – Generated sentences can aid code readability. ● However reading time is slower than human-written pseudo-code. There is still room for improvement. ● Future Works: – Considering more complicated generation ● Input: snippets, functions, classes ● Output: multiple sentences, documents – Applying to more language pairs – Automated preprocessing

×