Imran Sarwar Bajwa, M. Abbas Choudhary [2006], "Natural Language Processing based Automated System for UML Diagrams Generation", in Saudi 18th National Conference on Computer Application, 2006, (18th NCCA) Riyadh, Kingdom of Saudi Arabia pp:171-176
2. Proposed Solution
Object-oriented modeling in less time and effort is significant requirement. In order to resolve all
such issues and provide some robust solutions, a helpful framework is required, which has sound
ability to facilitate and assist both the users and software engineers. The functionality of the con-
ducted research was domain specific but it can be enhanced easily in the future according to the
requirements. Current designed system incorporate the capability of mapping user requirements
after reading the given requirements in plain text and drawing the set of UML diagrams as Class
Diagram, Activity Diagram, Sequence Diagram, Use case diagram and Component Diagram. An
Integrated Development Environment would also be provided for User Interaction and efficient Input
and output.
Object-Oriented Analysis and Design
Analysis and design of an information system relates to understand and intend the framework to
accomplish the actual job. Typically, design is relates to manage and control the complexity param-
eter in a domain. A robust design method also helps to split big tasks into controllable breakups
2 (Condamines, 2001). In software engineering, design methods provide various notation usually
graphical ones. These notations allow to store and communicate the perpetual design decisions.
Object-oriented design has overruled the typical analysis and design techniques as structured design
and data-driven design (Androutsopoulos, 1995). As compared to old style design paradigms, object-
oriented design models the every active entity to the problem domain using concept and methods
Object-oriented languages use variable of manifest the state of an object of objects.
or procedures to implement the behaviour of an object. For example, a ball could be an
Objects have:
• State (shape andare different parameters of shape as colour, size, diameter, shape, type,
object. There condition)
• Behaviourobject can also have behaviour as throw, roll, catch, hit, etc. The major task in
etc. This (What they perform)
analysis and design phase is to identify the valid objects and specify there states and
Object-oriented languages use variable to manifest the state of an object and methods or procedures
to behaviours. In conventional object. Forsystem analyst could be anthis tough job and then
implement the behaviour of an methods, example, a ball performs object. There are different
parameters ofinformation into UML using some graphicalThis object can or Rational Rose. as
maps this shape as colour, size, diameter, shape, type, etc. tool as Visio also have behaviour
throw, roll, catch, hit, etc. The major task in analysis and design phase is to identify the valid objects
and specify there states and behaviours. In conventional methods, system analyst performs this tough
job and then maps this information intoobjects are some graphical tool as Visiofrom a problem
In the context of this research, UML using automatically identified or Rational Rose.
domain. User provides the input text in English language related to the business
In domain. Afterthis research, analysisare automatically identified fromis performed on word
the context of the lexical objects of the text, syntax analysis a problem domain. User
providesto recognize theEnglishcategory (Androutsopoulos, 1995). First of the lexical analysis
level the input text in word language related to the business domain. After all the available
of the text, syntax analysis is performed on word level to recognize the word category (Androutso-
lexicons are categorized into nouns, pronouns, prepositions, adverbs, articles,
poulos, 1995). First of all the available lexicons are categorized into nouns, pronouns, prepositions,
adverbs, articles, etc. The syntacticThe syntacticthe programs would have to behave position a
conjunctions, conjunctions, etc. analysis of analysis of the programs would in a to be in
position to isolate subject, verbs, objects, adverbs, adjectives and variousother complements.It is
to isolate subject, verbs, objects, adverbs, adjectives and various other complements. It
little little complex and multipart procedure.
is complex and multipart procedure.
"Zia isis playingwith the red ball.”
“Zia playing with red ball."
For this example, following is theis the output.
For this example, following output.
Lexicons Phase-I Phase –II
Zia Noun Object
is Helping-Verb -------
playing Verb Method
with Preposition -------
the Article -------
red Noun Attribute
ball Noun Object
This is the final output of lexical assessment phase and all nouns are marked as objects and verbs
are marked as final output of lexical assessment phase and all nouns are marked In the above
This is the methods and all adjective are marked as states of that particular object. as objects
example, there are marked ‘Ali’methods andthe concerned methodmarked as states of that
and verbs is one object as and ‘work’ is all adjective are of the object Ali.
particular object. In the above example, there is one object ‘Ali’ and ‘work’ is the
Natural Language Processing
concerned method of the object Ali.
The understanding and multi-aspect processing of the natural languages that are also termed as
“speech languages”, is actually one of the arguments of greater interest in the field artificial intel-
ligence fieldLanguage Processing natural languages are irregular and asymmetrical. Tradition-
Natural (Strzalowski, 1995). The
ally, natural languages are based on un-formal grammars. There naturalgeographical, psychological
The understanding and multi-aspect processing of the are the languages that are also
and sociological factors which influence the behaviours of natural languages (Losee, 1996). There
termed as "speech languages", is actually one of the arguments of greater interest in the
field artificial intelligence field (Strzalowski, 1995). The natural languages are irregular
and asymmetrical. Traditionally, natural languages are based on un-formal grammars.
There are the geographical, psychological and sociological factors which influence the
3. are undefined set of words and they also change and vary area to area and time to time. Due to Natural
these variations and inconsistencies, the natural languages have different flavours as English lan- language
guage has more than half dozen renowned flavours all over the world. These flavours have different
accents, set of vocabularies and phonological aspects. These ominous and menacing discrepancies processing
and inconsistencies in natural languages make it a difficult task to process them as compared to the
formal languages (Krovetz, 1992).
In the process of analyzing and understanding the natural languages, various problems are usually
faced by the researchers. The problems connected to the greater complexity of the natural language
are verb’s conjugation, inflexion, lexical amplitude, problem of ambiguity, etc. From this set of
problems the problem which ever causes more difficulties is problem of ambiguity. Ambiguity
could be easily solved at the syntax and semantic level by using a sound and robust rule-based
system.
Used Methodology
Conventional natural language processing based systems use rule based systems. Agents are another 3
way to develop speech language based systems (Krovetz, 1992). In the research, a rule-based algo-
rithm has been designed and used which has robust ability to read, understand and extract the
desired information. First of all, basic elements of the language grammar are extracted (Drouin,
2004) as verbs, nouns, adjectives, etc then on the basis of this extracted information further pro-
cessing is performed. In linguistic terms, verbs often specify actions, and noun phrases the objects
that participate in the action (Zelle, 1993). Each noun phrase’s then role specifies how the object
participates in the action. As in the following example Ali is agent:
“Ali is writing a letter with a pen.”
A procedure that understands such a sentence must discover the agent because he performs the
action of writing, that the letter as the thematic object because it is the object that is written, and
that the pen is an instrument because it is the tool with which hitting is done (Gómez-Pérez, 2005).
Thus, complete sentence analysis finds information about the agent, co-agent, thematic object, ben-
eficiary, etc. The identification of such information specifically helps to understand the meanings of
the input sentence as given below.
Agent: The agent causes the action to occur as in “Ahmed hit the ball,” Ahmed is agent who per-
forms the task. But in this example a passive sentence, the agent also may appear as “The ball was
hit by Ahmed.’’
Co-agent: If agent is working with any other partner that is called co-agent. Both of them carry out
the action together as “Ahmed played tennis with Ali.”
Beneficiary: The beneficiary is the person for whom an action has bee performed: “Ahmed brought
the balls for Ali.” In this sentence Ali is beneficiary.
Thematic object: The thematic object is the object the sentence is really all about— typically the
object, undergoing a change. Often the thematic object is the same as the syntactic direct object, as
“Ahmed hit the ball.” Here the ball is thematic object.
Conveyance: The conveyance is something in which or on which agent travels: ‘Ahmed goes by
train.”
Trajectory: Motion from source to destination takes place over a trajectory. ID contrast to the other
role possibilities, several prepositions can serve to introduce trajectory noun phrases: “Ahmed and
Ali went to London from Islamabad”
Location: The location is where an action occurs. Several prepositions are manifesting the loca-
tion usually a noun phrase as “Ali studied in the library, at a desk, by the wall, a picture, near the
door.”
Time: Time specifies when an action occurs. Prepositions such at, before and after introduce noun
to depict time as “Ahmed and Ali left before Evening.”
Duration: Duration specifies how long an action takes. Preposition such as since and for indicate
duration. “Ahmed and Ali walked for an hour.”
4. Time: Time specifies when an action occurs. Prepositions such at, before and
after introduce noun to depict time as "Ahmed and Ali left before Evening."
Duration: Duration specifies how long an action takes. Preposition such as since
and for indicate duration. "Ahmed and Ali walked for an hour.”
Architecture of Designed Designed System
Architecture of System
The designed UMLG systemThis system draws diagrams UML diagrams after reading acquisition, Syntactic
The designed UMLG system hasto draw UML diagrams after reading thethe text scenario pro-
vided by the user.
has ability ability to draw in five modules: Text input text
scenario provided byText user. This system draws diagrams in five modules: Text input
Analysis, the understanding, Knowledge extraction, and finally Generation of UML diagrams as
acquisition,shown in following figure 1. understanding, Knowledge extraction, and finally
Syntactic Analysis, Text
Generation of UML diagrams as shown in following figure 1.
Class, activity, etc Diagrams
Diagram Generation
ure 1. Objects, methods, attributes Identification
ecture of
4
Natural Knowledge Extraction
guage
essing Understanding Meanings
sed
mated
Figure 1. Semantic Analysis
em for
Architecture
ML of the Natural Extracting Nouns, Verbs, Adjectives, etc
gramsLanguage
Processing
ration
based
Syntax Analysis
Automated
System Token Extraction from given text
for UML
Diagrams Lexical Analysis
Generation
Text Input Acquisition from user
i. Text input acquisitionacquisition
i. Text input
This module helps to acquire input text scenario. User provides the business scenario in from of para-
This module helpsof the text. This module scenario. input text in the formbusiness scenario in the words or
graphs to acquire input text reads the User provides the characters and generates
from of paragraphs (Tang, 2001) This module reads the input text in Thisform characters
lexicons of the text. by concatenating the input characters. the module is the implementation of
and generates the words or lexicons (Tang, 2001) by concatenating the input characters. this module.
the lexical phase. Language specified lexicons or tokens or symbols are generated in
This module is the implementation of the lexical phase. Language specified lexicons or
ii. Syntactic Analysis
tokens or symbols the second modulethisthe deigned framework and it reads the input from module one in the
This is are generated in of module.
ii. Syntactic form of words. These words are categorized into various classes as verbs, helping verbs, nouns, pro-
Analysis
nouns, adjectives, prepositions, conjunctions, (Fagan, 1989) etc on the basis of the defined rules for
This is the second module of the rules are defined here and it readsof the standard English grammatical rules
categorization. A set of deigned framework on the basis the input from module
one in the formcalled parts of speech conventions.
also of words. These words are categorized into various classes as verbs,
helping verbs, Text Understanding adjectives, prepositions, conjunctions, (Fagan, 1989)
iii. nouns, pronouns,
etc on the basis module defined rules for categorization. A set of words. The defined here given text are
This of the reads the input from module 1 in the form of rules are meanings of the
on the basis of the standard English semantic rules (Malaisé, 2005). These words are categorized into vari-
inferred on this module using grammatical rules also called parts of speech
conventions. classes as verbs, helping verbs, nouns, pronouns, adjectives, prepositions, conjunctions, etc.
ous
iv. Knowledge extraction
Required data attributes are extracted in this module (Rijsbergen, 1977) according to the given guide
lines. This module, extracts different objects and classes and their respective attributes on the basses
of the input provided by the preceding module. Nouns are symbolized as classes and objects and their
associated attributes are termed as attributes.
v. UML diagram generation
This is the last module, which finally uses UML symbols and draws various UML diagrams by com-
bining available symbols according to the information extracted of the previous module. As separate
5. diagrams diagram generation
v. UML by combining available symbols according to the information extracted of the
previous module. As separate scenario will be provided for various diagrams as classes,
This is the last module, which finally uses UML symbols and draws various UML
sequence and combining available so the separate functions information extracted of the
diagrams by activity diagrams, symbols according to the are implemented for the
respective module. As separate scenario will be provided for various diagrams as classes,
previous diagram.
sequence and activity diagrams, so the separate functions are implemented for the
Accuracy Evaluation
respective diagram.
To test the accuracyprovided for various diagramsby the designed system four parameters so the
scenario will be of the diagrams generated as classes, sequence and activity diagrams, Natural
separate functions are implemented for the respective diagram.
Accuracy Evaluation generated diagram from each category was checked. Maximum language
had been decided. Each
scoreAccuracy Evaluationthe diagrams generatednominations and extractions, the points
was declared 25. According to the wrong by the designed system four parameters
To test the accuracy of processing
wereTo testdecided. Eachof the diagrams generated by the designed system four parameters had been
had detected. A matrix ofgenerated diagram from each category was checked. Maximum
been the accuracy results of generated diagrams is shown below.
decided. Each generated diagram from each category was checked. Maximum score was declared
score was declared 25. According to the wrong nominations and extractions, the points
Table 1. were detected. A matrixwrong nominations and extractions,is shown below. detected. A matrix of
25. According to the the points were
results of generated diagrams is shown below. diagrams
of results of generated
Testing Dig. Types Objects Attributes Sequence labeling Total
results of
Table 1. Class 22 24 20 19 85%
different
Testing Dig. Types Objects Attributes Sequence labeling Total
UML of
results Activity 23 21 16 20 80%
Diagrams Class 22 24 20 19 85%
different Sequence 21 24 21 22 88%
UML Activity 23 21 16 20 80% 5
Diagrams
Sequence 21 24 21 22 88% Table 1.
A matrix representing UML diagrams accuracy test (%) for class, activity and sequence Testing
diagrams has been constructed. Overall diagrams accuracy for all types of UML results of
A matrix representing UML diagrams accuracy test (%) for class, activity and sequence diagrams different
diagrams is determinedUML diagrams accuracy test (%)typesclass, activity and is determined by
A matrix representing by adding total accuracy for all categories and calculating its
has been constructed. Overall diagrams accuracy of all for of UML diagrams sequence
average thattotal accuracy of case. UML
diagrams has83% in constructed. Overall calculating its average that is 83% in this case.
adding is been this all categories and diagrams accuracy for all types of UML
Diagrams
diagrams is determined by adding total accuracy of all categories and calculating its
average that is 83%30 this case.
in
Figure 2. 25
Graphical 30
20
Class
Figure 2.
presentation 25
15 Activity Figure 2.
Aof the
Graphical 20
10 Sequence A Graphical
ccuracy of
epresentation
Class
representation
15
generated
of the
5 Activity
of the
Diagrams of
accuracy
10
0 Sequence
accuracy of
generated 5 Objects Attributes Sequence labeling generated
Diagrams
Diagrams 0
The graph above is showing the accuracy ratio of various diagram types in terms of objects, attri-
Objects Attributes Sequence labeling
butes, sequence and labeling parameters.
Conclusion
This research is all about the dynamic generation of the UML diagrams by reading and analyzing
the given scenario in English language provided by the user. The designed system can find out the
classes and objects and their attributes and operations using an artificial intelligence technique
such as natural language processing. Then the UML diagrams such as Activity dig., Sequence dig.,
Component dig., Use Case dig., etc would be drawn. The accuracy of the software is expected up
to about 80% with the involvement of the software engineer provided that he has followed the
pre-requisites of the software to prepare the input scenario. The given scenario should be complete
and written in simple and correct English. Under the scope of our project, software will perform a
complete analysis of the scenario to find the classes, their attributes and operations. It will also draw
the following diagrams.
An elegant graphical user interface has also been provided to the user for entering the Input scenario
in a proper way and generating UML diagrams.
Future Work
The designed system for generating UML diagrams was started with the aims that there should be
a software which can read the user requirements given in the form English language text and can
draw the selected types of the UML diagrams such as Class diagram, activity diagram, sequence
diagram, use case diagram, component diagram, deployment diagram. But last three of them use
case diagram, component diagram, deployment diagram are still untouched.
There is also some margin of improvements in the algorithms for generating first four types Class
diagram, activity diagram, sequence diagram. Current accuracy of generating diagrams is about
6. 80% to 85%. It can be enhanced up to 95% by improving the algorithms and inducing the ability of
learning.
References
Androutsopoulos, G. D. Ritchie, and P. Thanisch. 1995. “Natural Language Interfaces to Databases – An Introduction.” Natural Language
Engineering, vol 1, part 1, pages 29–81.
B.J. Grosz, D. Appelt, P. Martin, and F. Pereira. (1987). “TEAM: An Experiment in the Design of Transportable Natural Language Inter-
faces”. Artificial Intelligence 32, pages 173–243.
Condamines, Anne and Rebeyrolle, Josette. (2001). “Searching for and identifying conceptual relationships via a corpus based approach
to a Terminological Knowledge Base (CTKB): Method and Results”, Recent Advances in Computational Terminology, pp.
127-148
Drouin Patrick. (2004). “Detection of Domain Specific Terminology Using Corpora Comparison.” Proceedings of the Fourth International
Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.
6 Fagan, J. L. (1989). “The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval”, Journal of the
American Society for Information Science, 40(2), 115–132.
Gómez-Pérez Asunción, F. Mariano, C. Oscar, (2004) “Ontological Engineering: with examples from the areas of Knowledge Manage-
ment”, e-Commerce and the Semantic Web. Springer
J. M. Zelle and R. J. Mooney, (1993), “Learning semantic grammars with constructive inductive logic programming”, in: Proceedings of
the 11th National Conference on Artificial Intelligence (AAAI Press/MIT Press, Washington, D.C.) , pp. 817–822.
Khoo Christopher, Chan Syin, Niu Yun, (2002) “The Many Facets of the Cause-Effect Relation”, The Semantics of Relationships. Kluwer
Academic Press. pp. 51-70
Krovetz, R., Croft, W. B. (1992). “Lexical ambiguity and information retrieval.” ACM Transactions on Information Systems, 10, pp.
115–141.
Losee, R. M. (1996). “Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis
for grammatical rules.” Information Processing and Management, 32(2), 185–197.
L. R. Tang and R. J. Mooney, 2001. “Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing”. In Proc.
of the 12th European Conference on Machine Learning (ECML- 2001), Freiburg, Germany, pages 466–477.
Malaisé Véronique, Zweigenbaum Pierre, Bachimont Bruno, (2005) “Mining Defining Contexts to Help Structuring Differential Ontolo-
gies”, Terminology, 11:1
Rijsbergen V., C. (1977). “A theoretical basis for use of co-occurrence data in information retrieval.” Journal of Documentation, 33(2),
106–119.
S. Weiss, C. Apte, D. Johnson, F. Oles, T. Goetz and T. Hampp, (1999), “Maximizing text-mining performance”, IEEE Intelligent Systems
14, 63-69.
Strzalowski, T. (1995). “Natural language information retrieval”. Journal of Information Processing and Management, 31(3), 397–417.