Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
API Usage Pattern Extraction using Semantic Similarity
1. SEMANTIC NETWORK BASED API
USAGE PATTERN EXTRACTION &
LEARNING
Mohammad Masudur Rahman
mor543@mail.usask.ca
Department of Computer Science
University of Saskatchewan
2. PRESENTATION OVERVIEW
Introduction
Motivating Example
Background Concepts
Proposed Approach
Semantic Network of Source code
API Usage Pattern Extraction
Pattern Learning & Visualization
Experimental Results & Discussions
Threats to Validity
Conclusion & Future Works
3. INTRODUCTION
API (Application Programming Interface) Libraries
API Documentation, API Browser, forums
API Usage learning for developers
Existing projects using APIs
API Usage Patterns
4. WHAT IS API USAGE PATTERN?
A frequent and consistent sequence of API method
calls and field accesses
Performs a particular programming task.
Widely used in multiple projects
Widely accepted by developers community
10. RESEARCH QUESTIONS
RQ 1: Can semantic network technologies
represent the semantics of OO source code
properly?
RQ 2: Can this representation be used for API
usage pattern extraction and learning?
11. BACKGROUND CONCEPTS
API Usage Patterns
API Usage Violation & Anomalies
Semantic Web
Semantic Network of Source Code
Resource Description Framework (RDF)
RDF Statement or Triples
16. API USAGE PATTERN EXTRACTION
Common Sub-graph Selection
Candidate API
usage Patterns
All Usages of
an API Class
Yes
Selected
API Usage
Patterns
Pattern
Score >
threshold ?
No
Discarded
17. EXPERIMENTAL RESULTS
25 Open source Projects
3 API libraries (java.io, java.util, java.awt)
250 API classes selected
API usages found for 113 API classes
Pattern found for 76 API classes
Total 776 patterns
22. RESULTS DISCUSSION
RQ 1: Can semantic network technologies
represent the semantics of OO source code
properly?
Graph-based API Usage Extraction by Nguyen et
al, FSE, 2009 : Incomplete semantics for edges and
attributes
Source code ontology by Wursch et al, ICSE, 2010
: Does not represent the complete source code
The proposed approach captures expression level
syntax and semantics
Focuses on API usage patterns
23. RESULTS DISCUSSION
RQ 2: Can this representation be used for API
usage pattern extraction and learning?
Successfully extracts 776 patterns for 76 API
classes from 25 open source projects
A potential approach to be explored more for API
usage pattern exploration
Visualization of RDF network helps in learning
Source code as visual entities rather than lines
More comprehensive idea about OO source code
Applicable for complex OO relationships
Very useful for quick learning
24. THREATS TO VALIDITY
Representing complete semantics: a non-trivial
task.
More expressions for more accurate representation
RDF pattern visualization within limited display
Need to be introduced with RDF convention
25. CONCLUSION & FUTURE WORKS
Applicability of semantic web technologies for API
usage pattern extraction
Semantic representation for learning by the
developers
Real world user study
Extracted patterns for automatic code completion in
the IDE.
Extracted patterns for API violation and anomaly
detection
27. REFERENCES
[1] Semantic web diagram.URL http://www.w3.org/
Talks/2002/10/16-sw/slide7-0.html.
[2] Tung Thanh Nguyen, Hoan Anh Nguyen, NamH.Pham, JafarM.Al-Kofahi, and
TienN.Nguyen. Graph-based mining of multiple object usage patterns. In Proc.
ESEC/FSE, 2009, pages 383-392.
[3] M.Wursch, G.Ghezzi, G.Reif,and H.C.Gall. Supporting developers with natural
language queries. In Proc. ICSE, 2010,pages 165-174
[4] Tao Xie and Jian Pei. Mapo:mining api usages from
open source repositories. In Proc. MSR, 2006, pages 574-57
[5] Semantic web technology.URL http://www.w3.org/
2001/sw
[6] Visual learning style.URL http://www.learning-styles-online.com/style/visualspatial.
[7] Apache Jena framework.URL http://jena.apache.org/.
[8] Javaparser-java 1.5 parser and ast.URL http://code.google.com/p/javaparser/.
[9] RDF-gravity tool.URL http://semweb.salzburgresearch.at/apps/rdf-gravity/.
Notes de l'éditeur
Hello everybodyWelcome to my presentationThis is Masudur RahmanToday I am going to present my research project titled as “Semantic network based API usage pattern extraction and learning”Hope you will enjoy the presentation
In my presentation I am going to cover the following topics.
API or Application programming interface is an interesting concept in modern OO programming languages as it encourages to reuse the existing programming resources without reinventing the wheels.But it is tough for the developer to master the APIs when there are a good number of complex APIs involved and there is no sufficient help to learn how to use those APIs. For example, we can mention API documentation, forum or API browser; however, they are not good enough to meet the developer’s learning need as they contain some simple examples or troubleshooting information. So, one possible solution is-consulting existing projects by other developers. Those projects have used the APIs and can provide some practical examples of the usage.However, we can not capture the whole API usage, rather we need to extract the API usage pattern which can provide sufficient knowledge of how to use an API.Here comes the term -API usage pattern
Now the question is- what is an API usage pattern?When a frequent and consistent sequence of API method calls and field accesses perform a particular programming task, then that sequence is called API usage pattern.It has to be widely used in different projectsIt has to be widely accepted by the developers community
For example, this is a code example for reading content from a file. From our research, we found that it is a common and frequent sequence of API calls to read the file content in numerous projects.It contains calls involved with Scanner class.So, it can be considered as a API usage pattern for Scanner API class.We are interested to extract such type of usage patterns for different API classesfrom different open source projects so that developer can learn them and use them in their works.
From our research, we got there are two board ways for API usage pattern extraction.Frequent method sequence miningGraph-based approachIn our research, we are following a relatively novel approach for API usage pattern extraction.We are using semantic web technologies.Ok lets explain the semantic web technologies.
Besides API usage pattern the most important concept needs to be explained is semantic web or network.Semantic web is a new breed of world wide web found by Tim Berners Lee. It consists of nodes and meaningful connecting edges which are not like simple hyperlinks rather they contains meaningful information. And each node is called a resource which can be a document, person, image, song or anything that can be identified by a REST URI in the web.Basically, semantic web is an efficient tool for knowledge representation and inference. For example, this is a simple semantic network representing some knowledge. Here, from the network, we can retrieve the living place information of a person written a particular software manual. This is easy for semantic web but relatively hard for current structure of www.We are interested to apply this kind of structure to capture the source code knowledge and use it for API usage pattern extraction and learning.
Let us consider a new developer has some knowledge about OO programming and he is assigned to fix a problem that involves complex API.He has to understand and learn the API to use in the work. Here is the example that opens a file and read the content of a file.The code is simple and contains two object usage – Scanner and File.But, the developer’s understanding about the source code depends on his knowledge about Java syntax and he has to grab the concepts from some source code lines which is not always easy, quick or helpful.
But if you consider this one, does it make sense?This is a semantic network version of the source code example I just showed. We, people are really visual beings and we understand relationships more easily from graphics or structures rather than texts. It looks a bit colorful, but a mediocre developer can understand the OO relationships between two objects without being concerned about the java syntax. For example, developer can understand the relationship between Scanner and File object. For example, here it shows that File is a parameter to the constructor of Scanner object. Similarly, other relationships can be represented like hasMethod, hasChild, hasContstructor can be in a simple graphical way.So, basically this type of representation of source code is really can help in understanding and learning for the developer.Also the structure can reflect the semantics and relationship among different source code entities like class, method, object, instances in a novel way which can be manipulated for API usage pattern extraction.Thus it is helpful for our research goal and we are motivated to use semantic network of source code for the research.
In this research, we try to answer these two research questions.RQ 1: Can semantic network technologies represent the semantics of OO source code properly?RQ 2: Can this representation be used for API usage pattern extraction and learning?
Here are some background concepts that we have explained so far. However, now we will discuss about the RDF, the framework used to implement the Semantic web or network.
However, semantic web is implemented using a framework called Resource Description Framework (RDF).The building block of RDF network is RDF statement or triples. Each statement represent a fact or a piece of knowledge about the network or system.Each triple has 3 components – subject, predicate and object.Subject: the entity about which the fact is described.Predicate: the attribute of the subjectObject: the attribute values of the subject.For example, Scanner.new is a node which type is a constructor.
Now comes the proposed approach for API usage pattern extraction and learning.At first we selected 25 open source projects and a list of API classes from 3 standard java libraries. Then we look for each java class from the open source projects for the API usages? We consider the java methods or constructors as the containers of the API usages. So, we extract them and parse them using AST parser provided by Eclipse.After parsing, we used the selected expression to develop the semantic network of source code that can be considered as an equivalent graphical representation of source code. Then we used those usage graphs to extract the API usage patterns.However, this is some overview, now, we describe some important steps more deeply.
Representing source code into equivalent semantic network. It is obviously a challenging task. However, we figured out a way to do that.Java source code is parsed by the AST parser provided by Eclipse. We parsed each Java statement up to expression level and got numbers of expression that express the semantics. However, in this research we are concerned with API usage patterns; so, we mainly focused on the expressions that reflect the OO semantics such as method call expression, field access expression, object creation expression and so on.We also used an advanced framework to deal with RDF technologies called Jena by Apache and thus we used all three – expression, selection rules, predicate list and Jena to develop the RDF network of each API usage.Basically, we developed the triples that contain those three parts – subject, predicate and object to represent every fact about the source code semantics and knowledge. Each semantic network is formed based on those triples. Thus, we got an equivalent semantic network for each API usage in the source code.Question is why do we need to represent the source in this way? Answer: Analyzing and processing source code directly is not easy. So, we need a structure which can be programmatically manipulated. Semantic web like structure provides more strength with its reasoning power through SPARQL.
Now comes the API usage pattern extraction from the RDF API usages.Basically, we exploited the strength of Jena framework for this purpose. From a list of usage graphs, we extracted all possible common subsets that capture the API calls, object creation, field access and all other API related information. These subsets are the isomorphic sub-graphs of each other and the possible candidates for the API usage patterns.However, then we calculated the score of each pattern candidate based on their frequency in the same project and frequency in multiple projects. Then, based on some thresholds, we considered a candidate pattern as a selected pattern for an API class.
Now comes the experimental results.We used 25 open source projects from different domains like java graphics, image manipulation, networking, domain management, utilities etc.We chose 3 standard API libraries and 250 classes from them.We detected the usages of 113 classes in those projects, however, we are able to extract patterns for 76 API classes.In total, we extracted 776 distinct patterns.
From our experiment, we extracted this type of usage patterns from the API usage.For example, this is an API usage pattern for BufferedInputStream API class. It is a sub-graph of the total API usage.This sub-graph of sub-network tells us how to create an object of a BufferedInputStream object.Create a File object and use it as a parameter to the constructor of FileInputStream.The use that FileInputStream object to construct the BufferedInputStream object.
While the semantic representation can be used for learning and understanding, it can also converted into the corresponding source code skeleton.So, basically, this is the source code version of the API usage patterns and it also represent the same information as the network does.For this task, we actually parsed the semantic network, that means the triples to generate the code skeletons which can be helpful for the developer in writing the actual code.
Here is the table that shows a portion of our results.For example, JHotdraw7 is a commonly used java project for different software maintenance activities.We found 689 java classes, 7330 methods and constructors49 API classes are found in the project and 2547 patterns are used.However, 462 distinct patterns are extracted We applied the experiments on 25 projects and we found our approach quite promising in extracting the API usage patterns.
We compared our results with the results of Nguyen et al, FSE, 2009.We considered top 5 projects based on their size and API patterns found.The graph shows average no. of patterns found per project class file. Here, we can see that our approach shows better performance in case of 3 projects. Also we notice a regular pattern in our results which is absent the Nguyen’s approach.Basically, we checked those last 3 projects and found they are highly popular and active in real world. Additionally, we found that they are involved with more API usages than other projects. So, we can also infer that advanced and frequent API usage may be a possible cause for their popularity.Though, we used a different set of projects than Nguyen et al, we found an interesting correlation between these two set of results.Thus, it is reasonable to think that the proposed approach is a suitable candidate for API usage pattern extraction. However, we are working on to make it more efficient.
Now comes about the answers of the research questions we stated at the very beginning.RQ 1: Can semantic network technologies represent the semantics of OO source code properly?Existing work by Nguyen et al uses a graph-based approach but the source code representation was not completely semantic as the connecting edges were treated too abstractly such as data dependency or control dependency, but our approach decomposes that relationships and dependency into more granular level and more importantly, ours one can be used for knowledge inference.Existing work by Wursch et al develops the source code ontology, but that is not a proper representation of source code, rather it contains partial information about the source code.So none of the existing works actually convert the source code into a semantic representation this much. Our approach can capture the semantics of OO source code more broadly than existing approaches.
RQ 2: Can this representation be used for API usage pattern extraction and learning?Yes, the semantic representation is quite helpful for API usage pattern extraction as we have already did that.Moreover, this representation is found as a potential approach for learning API by the developers because of its visual and descriptive logic nature.Basically, we try to add a novel concept for API learning and understanding.We are still working on it and this current outcome can be considered as a preliminary results of the whole idea.
While working, we faced few challenges which we tried to overcome.The complete semantic representation is a non-trivial task as it involves too many expression of a complete programming language. In this research, we tried to capture the OO features/concepts of Java as we focused on API usage patterns. But if more expressions are considered, more accurate representation is possible.We also found that RDF visualization within a limited display is challenging.
So, we proposed a new approach for representing OO source code in a semantic network fashion which is helpful forAPI usage pattern extraction, learning and visualization.More importantly, it capture the source code semantics than existing graph based approaches.This research also leads us to further research problem and we have some future plans:We will conduct a real world user study to determine its real benefitsWe will apply the extracted API usage patterns for code completion in the Eclipse IDEAlso will be used for API usage violation or anomaly detection.
That’s all about my presentation.Thanks to everybody.