A Predictive Stock Data Analysis with SVM-PCA Model .......................................................................1
Divya Joseph and Vinai George Biju
HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space.......................................... 12
Mohammad Reza Abbasifard, Hassan Naderi and Mohadese Mirjalili
A Survey on Mobile Malware: A War without End................................................................................... 23
Sonal Mohite and Prof. R. S. Sonar
An Efficient Design Tool to Detect Inconsistencies in UML Design Models............................................. 36
Mythili Thirugnanam and Sumathy Subramaniam
An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment
Analysis, Ant Colony Optimization and Gene Expression Programming ................................................. 45
Chih-Ming Hsu
Emerging Technologies: LTE vs. WiMAX ................................................................................................... 66
Mohammad Arifin Rahman Khan and Md. Sadiq Iqbal
Introducing E-Maintenance 2.0 ................................................................................................................. 80
Abdessamad Mouzoune and Saoudi Taibi
Detection of Clones in Digital Images........................................................................................................ 91
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary
The Significance of Genetic Algorithms in Search, Evolution, Optimization and Hybridization: A Short
Review ...................................................................................................................................................... 103
1. ISSN: 1694-2507 (Print)
ISSN: 1694-2108 (Online)
International Journal of Computer Science
and Business Informatics
(IJCSBI.ORG)
VOL 9, NO 1
JANUARY 2014
2. Table of Contents VOL 9, NO 1 JANUARY 2014
A Predictive Stock Data Analysis with SVM-PCA Model .......................................................................1
Divya Joseph and Vinai George Biju
HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space.......................................... 12
Mohammad Reza Abbasifard, Hassan Naderi and Mohadese Mirjalili
A Survey on Mobile Malware: A War without End................................................................................... 23
Sonal Mohite and Prof. R. S. Sonar
An Efficient Design Tool to Detect Inconsistencies in UML Design Models............................................. 36
Mythili Thirugnanam and Sumathy Subramaniam
An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment
Analysis, Ant Colony Optimization and Gene Expression Programming ................................................. 45
Chih-Ming Hsu
Emerging Technologies: LTE vs. WiMAX ................................................................................................... 66
Mohammad Arifin Rahman Khan and Md. Sadiq Iqbal
Introducing E-Maintenance 2.0 ................................................................................................................. 80
Abdessamad Mouzoune and Saoudi Taibi
Detection of Clones in Digital Images........................................................................................................ 91
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary
The Significance of Genetic Algorithms in Search, Evolution, Optimization and Hybridization: A Short
Review ...................................................................................................................................................... 103
IJCSBI.ORG
4. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 1
A Predictive Stock Data Analysis
with SVM-PCA Model
Divya Joseph
PG Scholar, Department of Computer Science and Engineering
Christ University Faculty of Engineering
Christ University, Kanmanike, Mysore Road, Bangalore - 560060
Vinai George Biju
Asst. Professor, Department of Computer Science and Engineering
Christ University Faculty of Engineering
Christ University, Kanmanike, Mysore Road, Bangalore – 560060
ABSTRACT
In this paper the properties of Support Vector Machines (SVM) on the financial time series
data has been analyzed. The high dimensional stock data consists of many features or
attributes. Most of the attributes of features are uninformative for classification. Detecting
trends of stock market data is a difficult task as they have complex, nonlinear, dynamic and
chaotic behaviour. To improve the forecasting of stock data performance different models
can be combined to increase the capture of different data patterns. The performance of the
model can be improved by using only the informative attributes for prediction. The
uninformative attributes are removed to increase the efficiency of the model. The
uninformative attributes from the stock data are eliminated using the dimensionality
reduction technique: Principal Component Analysis (PCA). The classification accuracy of
the stock data is compared when all the attributes of stock data are being considered that is,
SVM without PCA and the SVM-PCA model which consists of informative attributes.
Keywords
Machine Learning, stock analysis, prediction, support vector machines, principal
component analysis.
1. INTRODUCTION
Time series analysis and prediction is an important task in all fields of
science for applications like forecasting the weather, forecasting the
electricity demand, research in medical sciences, financial forecasting,
process monitoring and process control, etc [1][2][3]. Machine learning
techniques are widely used for solving pattern prediction problems. The
financial time series stock prediction is considered to be a very challenging
task for analysts, investigator and economists [4]. A vast number of studies
in the past have used artificial neural networks (ANN) and genetic
algorithms for the time series data [5]. Many real time applications are using
the ANN tool for time-series modelling and forecasting [6]. Furthermore the
5. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 2
researchers hybridized the artificial intelligence techniques. Kohara et al. [7]
incorporated prior knowledge to improve the performance of stock market
prediction. Tsaih et al. [8] integrated the rule-based technique and ANN to
predict the direction of the S& P 500 stock index futures on a daily basis.
Some of these studies, however, showed that ANN had some limitations in
learning the patterns because stock market data has tremendous noise and
complex dimensionality [9]. ANN often exhibits inconsistent and
unpredictable performance on noisy data [10]. However, back-propagation
(BP) neural network, the most popular neural network model, suffers from
difficulty in selecting a large number of controlling parameters which
include relevant input variables, hidden layer size, learning rate, and
momentum term [11].
This paper proceeds as follows. In the next section, the concepts of support
vector machines. Section 3 describes the principal component analysis.
Section 4 describes the implementation and model used for the prediction of
stock price index. Section 5 provides the results of the models. Section 6
presents the conclusion.
2. SUPPORT VECTOR MACHINES
Support vector machines (SVMs) are very popular linear discrimination
methods that build on a simple yet powerful idea [12]. Samples are mapped
from the original input space into a high-dimensional feature space, in
which a „best‟ separating hyperplane can be found. A separating hyperplane
H is best if its margin is largest [13].
The margin is defined as the largest distance between two hyperplanes
parallel to H on both sides that do not contain sample points between them
(we will see later a refinement to this definition) [12]. It follows from the
risk minimization principle (an assessment of the expected loss or error, i.e.,
the misclassification of samples) that the generalization error of the
classifier is better if the margin is larger.
The separating hyperplane that are the closest points for different classes at
maximum distance from it is preferred, as the two groups of samples are
separated from each other by a largest margin, and thus least sensitive to
minor errors in the hyperplane‟s direction [14].
6. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 3
2.1 Linearly Separable Data
Consider that there exist two classes and uses two labels -1 and +1 for two
classes. The sample is { , }t t
x r where rt
= +1 if xt
ϵ C1 and rt
= -1 if xt
ϵ C2.
To find w and w0 such that
where, represents set of n points
xt
represents p dimensional real vector
rt
represents the class (i.e. +1 or -1)
0 1 for r 1T t t
w x w
0 1 for r 1T t t
w x w
Which can be rewritten as:
0( ) 1t T t
r w x w (1)
Here the instances are required to be on the right of the hyperplane and what
them to be a distance away for better generalization. The distance from the
hyperplane to the instances closest to it on either side is called the margin,
which we want to maximize for best generalization.
The optimal separating hyperplane is the one that maximizes the margin.
The following equation represents the offset of hyperplane from the origin
along the normal w.
0| |
|| ||
T t
w x w
w
which, when rt
ϵ {+1,-1}, can be written as
0( )
|| ||
t T t
r w x w
w
Consider this to be some value ρ:
0( )
, t
|| ||
t T t
r w x w
w
(2)
7. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 4
In order to maximize ρ but there are an infinite number of solutions that are
obtained by scaling w, therefore consider ρ ||w|| = 1. Thus to maximize the
margin ||w|| is minimized.
2
0
1
min || || subject to r ( ) 1,
2
t T t
w w x w t (3)
Figure 1 The geometry of the margin consists of the canonical hyperplanes H1 and H2.
The margin is the distance between the separating (g(x) =0) and a
hyperplane through the closest points (marked by a ring around the data
points). The round rings are termed as support vectors.
This is a standard optimization problem, whose complexity depends on d,
and it can be solved directly to find w and w0. Then, on both sides of the
hyperplane, there will be instances that are 1
|| ||w
. As there will be two
margins along the sides of the hyperplane we sum it up to
2
|| ||w
.
If the problem is not linearly separable instead of fitting a nonlinear
function, one trick is to map the problem to a new space by using nonlinear
basis function. Generally the new spaces has many more dimensions than
the original space, and in such a case, the most interesting part is the method
whose complexity does not depend on the input dimensionality. To obtain a
new formulation, the Eq. (3) is written as an unconstrained problem using
Lagrange multipliers αt
:
8. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 5
2
0
1
2
0
1 1
1
|| || [ ( ) 1]
2
1
= || || ( ) +
2
N
t t T t
p
t
t t T t t
t t
L w r w x w
w r w x w
This can be minimized with respect to w, w0 and maximized with respect to
αt
≥ 0. The saddle point gives the solution.
This is a convex quadratic optimization problem because the main term is
convex and the linear constraints are also convex. Therefore, the dual
problem is solved equivalently by making use of the Karush-Kuhn-Tucker
conditions. The dual is to maximize Lp with respect to w and w0 are 0 and
also that αt
≥ 0.
1
0 w =
n
p t t t
i
L
r x
w
(5)
10
0 w = = 0
n
p t t
i
L
r
w
(6)
Substituting Eq. (5) and Eq. (6) in Eq. (4), the following is obtained:
0
1
( )
2
T T t t t t t t
d
t t t
L w w w r x w r
1
= - ( )
2
t s t s t T s t
t s t
r x x x (7)
which can be minimized with respect to αt
only, subject to the constraints
0, and 0, tt t t
t
r
This can be solved using the quadratic optimization methods. The size of the
dual depends on N, sample size, and not on d, the input dimensionality.
9. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 6
Once αt
is solved only a small percentage have αt
> 0 as most of them vanish
with αt
= 0.
The set of xt
whose xt
> 0 are the support vectors, then w is written as
weighted sum of these training instances that are selected as support vectors.
These are the xt
that satisfy and lie on the margin. This can be used to
calculate w0 from any support vector as
0
t T t
w r w x (8)
For numerical stability it is advised that this be done for all support vectors
and average be taken. The discriminant thus found is called support vector
machine (SVM) [1].
3. PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a powerful tool for dimensionality
reduction. The advantage of PCA is that if the data patterns are understood
then the data is compressed by reducing the number of dimensions. The
information loss is considerably less.
Figure 2 Diagrammatic Representation of Principal Component Analysis (PCA)
10. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 7
4. CASE STUDY
An investor in stocks ideally should get maximum returns on the investment
made and for that should know which stocks will do well in future. So this
is the basic incentive for forecasting stock prices. For this, he has to study
about different stocks, their price history, performance and reputation of the
stock company, etc. So this is a broad area of study. There exists
considerable evidence showing that stock returns are to some extent
predictable. Most of the research is conducted using data from well
established stock markets such as the US, Western Europe, and Japan. It is,
thus, of interest to study the extent of stock market predictability using data
from less well established stock markets such as that of India.
Analysts monitor changes of these numbers to decide their trading. As long
as past stock prices and trading volumes are not fully discounted by the
market, technical analysis has its value on forecasting. To maximize profits
from the stock market, more and more “best” forecasting techniques are
used by different traders. The research data set that has been used in this
study is from State Bank of India. The series spans from 10th January 2012
to 18th September 2013. The first training and testing dataset consists of 30
attributes. The second training and testing dataset consists of 5 attributes
selected from the dimensionality reduction technique using Weka tool:
PCA.
Table 1 Number of instances in the case study
State Bank of India Stock Index
Total Number of Instances 400
Training Instances 300
Testing Instances 100
The purpose of this study is to predict the directions of daily change of the
SBI Index. Direction is a categorical variable to indicate the movement
direction of SBI Index at any time t. They are categorized as “0” or “1” in
the research data. “0” means that the next day‟s index is lower than today‟s
index, and “1” means that the next day‟s index is higher than today‟s index.
The stock data classification is implementation with Weka 3.7.9. The k-fold
cross validation is considered for the classification. In the k-fold cross-
validation, the original sample is randomly partitioned into k subsamples.
Of the k subsamples, a single subsample is retained as the validation data
for testing the model, and the remaining k – 1 subsamples are used as
11. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 8
training data [15]. The cross validation variable k is set to 10 for the stock
dataset [16].The cross-validation process is then repeated k times (the folds),
with each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined) to
produce a single estimation.
Figure 3 Weka Screenshot of PCA
At first the model is trained with SVM and the results with the test data is
saved. Second, the dimensionality reduction technique such as PCA is
applied to the training dataset. The PCA selects the attributes which give
more information for the stock index classification. The number of attributes
for classification is now reduced from 30 attributes to 5 attributes.
The most informative attributes are only being considered for classification.
A new model is trained on SVM with the reduced attributes. The test data
with reduces attributes is provided to the model and the result is saved. The
results of both the models are compared and analysed.
5. EXPERIMENTAL RESULTS
5.1 Classification without using PCA
From the tables displayed below 300 stock index instances were considered
as training data and 100 stock index instances were considered as test data.
With respect to the test data 43% instances were correctly classified and
57% instances were incorrectly classified.
12. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 9
Table 2 Number of instances for classification without using PCA
Number of Instances and Attributes
Number of Train Instances Number of Test Instances Number of
Attributes
300 100 30
Table 3 Classification accuracy without using PCA
Classification Accuracy
Correctly Classified Instances 43%
Incorrectly Classified Instances 57%
5.2 Classification with PCA
From the tables displayed below 300 stock index instances were considered
as training data and 100 stock index instances were considered as test data.
With respect to the test data 59% instances were correctly classified and
41% instances were incorrectly classified.
Table 4 Number of instances for classification without using PCA
Number of Instances and Attributes
Number of Train Instances Number of Test Instances Number of
Attributes
300 100 5
Table 5 Classification accuracy without using PCA
Classification Accuracy
Correctly Classified Instances 59%
Incorrectly Classified Instances 41%
6. CONCLUSION
The Support Vector Machines can produce accurate and robust
classification results on a sound theoretical basis, even when input stock
data are non-monotone and non-linearly separable. The Support Vector
Machines evaluates more relevant information in a convenient way. The
principal component analysis is an efficient dimensionality reduction
method which gives a better SVM classification on the stock data. The
SVM-PCA model analyzes the stock data with fewer and most relevant
13. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 10
features. In this way a better idea about the stock data is obtained and in turn
gives an efficient knowledge extraction on the stock indices. The stock data
classified better with SVM-PCA model when compared to the classification
with SVM alone. The SVM-PCA model also reduces the computational cost
drastically. The instances are labelled with nominal values for the current
case study. The future enhancement to this paper would be to use numerical
values for labelling instead of nominal values.
7. ACKNOWLEDGMENTS
We express our sincere gratitude to the Computer Science and Engineering
Department of Christ University Faculty of Engineering especially
Prof. K Balachandran for his constant motivation and support.
REFERENCES
[1] Divya Joseph, Vinai George Biju, “A Review of Classifying High Dimensional Data to
Small Subspaces”, Proceedings of International Conference on Business Intelligence at
IIM Bangalore, 2013.
[2] Claudio V. Ribeiro, Ronaldo R. Goldschmidt, Ricardo Choren, A Reuse-based
Environment to Build Ensembles for Time Series Forecasting, Journal of Software,
Vol. 7, No. 11, Pages 2450-2459, 2012.
[3] Dr. A. Chitra, S. Uma, "An Ensemble Model of Multiple Classifiers for Time Series
Prediction", International Journal of Computer Theory and Engineering, Vol. 2, No. 3,
pages 454-458, 2010.
[4] Sundaresh Ramnath, Steve Rock, Philip Shane, "The financial analyst forecasting
literature: A taxonomy with suggestions for further research", International Journal of
Forecasting 24 (2008) 34–75.
[5] Konstantinos Theofilatos, Spiros Likothanassis, Andreas Karathanasopoulos, Modeling
and Trading the EUR/USD Exchange Rate Using Machine Learning Techniques,
ETASR - Engineering, Technology & Applied Science Research Vol. 2, No. 5, pages
269-272, 2012.
[6] A simulation study of artificial neural networks for nonlinear time-series forecasting.
G. Peter Zhang, B. Eddy Patuwo, and Michael Y. Hu. Computers & OR 28(4):381-
396 (2001)
[7] K. Kohara, T. Ishikawa, Y. Fukuhara, Y. Nakamura, Stock price prediction using prior
knowledge and neural networks, Int. J. Intell. Syst. Accounting Finance Manage. 6 (1)
(1997) 11–22.
[8] R. Tsaih, Y. Hsu, C.C. Lai, Forecasting S& P 500 stock index futures with a hybrid AI
system, Decision Support Syst. 23 (2) (1998) 161–174.
[9] Mahesh Khadka, K. M. George, Nohpill Park, "Performance Analysis of Hybrid
Forecasting Model In Stock Market Forecasting", International Journal of Managing
Information Technology (IJMIT), Vol. 4, No. 3, August 2012.
[10]Kyoung-jae Kim, “Artificial neural networks with evolutionary instance selection for
financial forecasting. Expert System. Application 30, 3 (April 2006), 519-526.
[11]Guoqiang Zhang, B. Eddy Patuwo, Michael Y. Hu, “Forecasting with artificial neural
networks: The state of the art”, International Journal of Forecasting 14 (1998) 35–62.
14. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 11
[12]K. Kim, I. Han, Genetic algorithms approach to feature discretization in artificial
neural networks for the prediction of stock price index, Expert Syst. Appl. 19 (2)
(2000) 125–132.
[13]F. Cai and V. Cherkassky “Generalized SMO algorithm for SVM-based multitask
learning", IEEE Trans. Neural Netw. Learn. Syst., Vol. 23, No. 6, pp.997 -1003, 2012.
[14]Corinna Cortes and Vladimir Vapnik, Support-Vector Networks. Mach. Learn. 20,
Volume 3, 273-297, 1995.
[15]Shivanee Pandey, Rohit Miri, S. R. Tandan, "Diagnosis And Classification Of
Hypothyroid Disease Using Data Mining Techniques", International Journal of
Engineering Research & Technology, Volume 2 - Issue 6, June 2013.
[16]Hui Shen, William J. Welch and Jacqueline M. Hughes-Oliver, "Efficient, Adaptive
Cross-Validation for Tuning and Comparing Models, with Application to Drug
Discovery", The Annals of Applied Statistics 2011, Vol. 5, No. 4, 2668–2687,
February 2012, Institute of Mathematical Statistics.
This paper may be cited as:
Joseph, D. and Biju, V. G., 2014. A Predictive Stock Data Analysis with
SVM-PCA Model. International Journal of Computer Science and Business
Informatics, Vol. 9, No. 1, pp. 1-11.
15. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 12
HOV-kNN: A New Algorithm to
Nearest Neighbor Search in
Dynamic Space
Mohammad Reza Abbasifard
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
Hassan Naderi
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
Mohadese Mirjalili
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
ABSTRACT
Nearest neighbor search is one of the most important problem in computer science due to
its numerous applications. Recently, researchers have difficulty to find nearest neighbors in
a dynamic space. Unfortunately, in contrast to static space, there are not many works in this
new area. In this paper we introduce a new nearest neighbor search algorithm (called
HOV-kNN) suitable for dynamic space due to eliminating widespread preprocessing step in
static approaches. The basic idea of our algorithm is eliminating unnecessary computations
in Higher Order Voronoi Diagram (HOVD) to efficiently find nearest neighbors. The
proposed algorithm can report k-nearest neighbor with time complexity O(knlogn) in
contrast to previous work which wasO(k2
nlogn). In order to show its accuracy, we have
implemented this algorithm and evaluated is using an automatic and randomly generated
data point set.
Keywords
Nearest Neighbor search, Dynamic Space, Higher Order Voronoi Diagram.
1. INTRODUCTION
The Nearest Neighbor search (NNS) is one of the main problems in
computer science with numerous applications such as: pattern recognition,
machine learning, information retrieval and spatio-temporal databases [1-6].
Different approaches and algorithms have been proposed to these diverse
applications. In a well-known categorization, these approaches and
algorithms could be divided into static and dynamic (moving points). The
16. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 13
existing algorithms and approaches can be divided into three categories,
based on the fact that whether the query points and/or data objects are
moving. They are (i) static kNN query for static objects, (ii) moving
kNNquery for static objects, and (iii) moving kNN query for moving objects
[15].
In the first category data points as well as query point(s) have stationary
positions [4, 5]. Most of these approaches, first index data points by
performing a pre-processing operation in order to constructing a specific
data structure. It’s usually possible to carry out different search algorithms
on a given data structure to find nearest neighbors. Unfortunately, the pre-
processing step, index construction, has a high complexity and takes more
time in comparison to search step. This time could be reasonable when the
space is static, because by just constructing the data structure multiple
queries can be accomplished. In other words, taken time to pre-processing
step will be amortized over query execution time. In this case, searching
algorithm has a logarithmic time complexity. Therefore, these approaches
are useful, when it’s necessary to have a high velocity query execution on
large stationary data volume.
Some applications need to have the answer to a query as soon as the data is
accessible, and they cannot tolerate the pre-processing execution time. For
example, in a dynamic space when data points are moving, spending such
time to construct a temporary index is illogical. As a result approaches that
act very well in static space may be useless in dynamic one.
In this paper a new method, so called HOV-kNN, suitable for finding k
nearest neighbor in a dynamic environment, will be presented. In k-nearest
neighbor search problem, given a set P of points in a d-dimensional
Euclidian space𝑅 𝑑
(𝑃 ⊂ 𝑅 𝑑
) and a query point q (𝑞 ∈ 𝑅 𝑑
), the problem is
to find k nearest points to the given query point q [2, 7]. Proposed algorithm
has a good query execution complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) without enduring from
time-consuming pre-processing process. This approach is based on the well-
known Voronoi diagrams (VD) [11]. As an innovation, we have changed the
Fortune algorithm [13] in order to created order k Voronoi diagrams that
will be used for finding kNN.
The organization of this paper is as follow. Next section gives an overview
on related works. In section 3 basic concepts and definitions have been
presented. Section 4 our new approach HOV-kNN is explained. Our
experimental results are discussed in section 5. We have finished our paper
with a conclusion and future woks in section 6.
2. RELATED WORKS
Recently, many methods have been proposed for k-nearest neighbor search
problem. A naive solution for the NNS problem is using linear search
17. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 14
method that computes distance from the query to every single point in the
dataset and returns the k closest points. This approach is guaranteed to find
the exact nearest neighbors [6]. However, this solution can be expensive for
massive datasets. So approximate nearest neighbor search algorithms are
presented even for static spaces [2].
One of the main parts in NNS problem is data structure that is roughly used
in every approach. Among different data structures, various tree search most
used structures which can be applied in both static and dynamic spaces.
Listing proposed solutions to kNN for static space is out of scope of this
paper. The interested reader can refer to more comprehensive and detailed
discussions of this subject by [4, 5]. Just to name some more important
structures, we can point to kd-tree, ball-tree, R-tree, R*-tree, B-tree and X-
tree [2-5, 8, 9].In contrast, there are a number of papers that use graph data
structure for nearest neighbor search. For example, Hajebi et al have
performed Hill-climbing in kNN graph. They built a nearest neighbor graph
in an offline phase, and performed a greedy search on it to find the closest
node to the query [6].
However, the focus of this paper is on dynamic space. In contrast to static
space, finding nearest neighbors in a dynamic environment is a new topic of
research with relatively limited number of publications. Song and
Roussopoulos have proposed Fixed Upper Bound Algorithm, Lazy Search
Algorithm, Pre-fetching Search Algorithm and Dual Buffer Search to find k-
nearest neighbors for a moving query point in a static space with stationary
data points [8]. Güting et al have presented a filter-and-refine approach to
kNN search problem in a space that both data points and query points are
moving. The filter step traverses the index and creates a stream of so-called
units (linear pieces of a trajectory) as a superset of the units required to build
query’s results. The refinement step processes an ordered stream of units
and determines the pieces of units forming the final precise result
[9].Frentzos et al showed mechanisms to perform NN search on structures
such as R-tree, TB-Tree, 3D-R-Tree for moving objects trajectories. They
used depth-first and best-first algorithms in their method [10].
As mentioned, we use Voronoi diagram [11] to find kNN in a dynamic
space. D.T. Lee used Voronoi diagram to find k nearest neighbor. He
described an algorithm for computing order-k Voronoi diagram in
𝑂(𝑘2
𝑛𝑙𝑜𝑔𝑛) time and 𝑂(𝑘2
(𝑁 − 𝑘)) space [12] which is a sequential
algorithm. Henning Meyerhenke presented and analyzed a parallel
algorithm for constructing HOVD for two parallel models: PRAM and CGM
[14]. In these models he used Lee’s iterative approach but his model stake
𝑂
𝑘2(𝑛−𝑘)𝑙𝑜𝑔𝑛
𝑝
running time and 𝑂(𝑘) communication rounds on a CGM
18. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 15
with 𝑂(
𝑘2(𝑁−𝑘)
𝑝
) local memory per processor [14]. p is the number of
participant machines.
3. BASIC CONCEPTS AND DEFINITIONS
Let P be a set of n sites (points) in the Euclidean plane. The Voronoi
diagram informally is a subdivision of the plane into cells (Figure 1)which
each point of that has the same closest site [11].
Figure 1.Voronoi Diagram
Euclidean distance between two points p and q is denoted by 𝑑𝑖𝑠𝑡 𝑝, 𝑞 :
𝑑𝑖𝑠𝑡 𝑝, 𝑞 : = (𝑝𝑥 − 𝑞𝑥)2 + (𝑝𝑦 − 𝑞𝑦)2 (1)
Definition (Voronoi diagram):Let 𝑃 = {𝑝1, 𝑝2, … , 𝑝 𝑛 } be a set of n distinct
points (so called sites) in the plane. Voronoi diagram of P is defined as the
subdivision of the plane into n cells, one for each site in P, with the
characteristic that q in the cell corresponding to site 𝑝𝑖 if𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑖 <
𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑗 for each 𝑝𝑗 ∈ 𝑃 𝑤𝑖𝑡ℎ 𝑗 ≠ 𝑖 [11].
Historically, 𝑂(𝑛2
)incremental algorithms for computing VD were known
for many years. Then 𝑂 𝑛𝑙𝑜𝑔𝑛 algorithm was introduced that this
algorithm was based on divide and conquer, which was complex and
difficult to understand. Then Steven Fortune [13] proposed a plane sweep
algorithm, which provided a simpler 𝑂 𝑛𝑙𝑜𝑔𝑛 solution to the problem.
Instead of partitioning the space into regions according to the closest sites,
one can also partition it according to the k closest sites, for some 1 ≤ 𝑘 ≤
𝑛 − 1. The diagrams obtained in this way are called higher-order Voronoi
diagrams or HOVD, and for given k, the diagram is called the order-k
Voronoi diagram [11]. Note that the order-1 Voronoi diagram is nothing
more than the standard VD. The order-(n−1) Voronoi diagram is the
farthest-point Voronoi diagram (Given a set P of points in the plane, a point
of P has a cell in the farthest-point VD if it is a vertex of the convex hull),
because the Voronoi cell of a point 𝑝𝑖 is now the region of points for which
𝑝𝑖 is the farthest site. Currently the best known algorithms for computing the
19. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 16
order-k Voronoi diagram run in 𝑂(𝑛𝑙𝑜𝑔3
𝑛 + 𝑛𝑘) time and in 𝑂(𝑛𝑙𝑜𝑔𝑛 +
𝑛𝑘2 𝑐𝑙𝑜𝑔 ∗ 𝑘
) time, where c is a constant [11].
Figure 2. Farthest-Point Voronoi diagram [11]
Consider x and y as two distinct elements of P. A set of points construct a
cell in the second order Voronoi diagram for which the nearest and the
second nearest neighbors are x and y. Second order Voronoi diagram can be
used when we are interested in the two closest points, and we want a
diagram to captures that.
Figure 3.An instant of HOVD [11]
4. SUGGESTED ALGORITHM
As mentioned before, one of the best algorithms to construct Voronoi
diagram is Fortune algorithm. Furthermore HOVD can be used to find k-
nearest neighbors [12]. D.T. Lee used an 𝑂 𝑘2
𝑛𝑙𝑜𝑔𝑛 algorithm to
construct a complete HOVD to obtain nearest neighbors. In D.T. Lee's
algorithm, at first the first order Voronoi diagram is obtained, and then finds
the region of diagram that contains query point. The point that is in this
region is defined as a first neighbor of query point. In the next step of Lee’s
algorithm, this nearest point to the query will be omitted from dataset, and
this process will be repeated. In other words, the Voronoi diagram is built
on the rest of points. In the second repetition of this process, the second
neighbor is found and so on. So the nearer neighbors to a given query point
are found sequentially.
20. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 17
However we think that nearest neighbors can be finding without completing
the process of HOVD construction. More precisely, in Lee’s algorithm each
time after omitting each nearest neighbor, next order of Voronoi diagram is
made completely (edges and vertices) and then for computing a neighbor
performs the search algorithm. In contrast, in our algorithm, the vertices of
Voronoi diagram are only computed and the neighbors of the query are
found during process of vertices computing. So in our algorithm, the
overhead of edge computing to find neighbors is effectively omitted. As we
will show later in this paper, by eliminating this superfluous computation a
more efficiently algorithm in term of time complexity will be obtained.
We use Fortune algorithm to create Voronoi diagram. Because of space
limitation in this paper we don’t describe this algorithm and the respectable
readers can refer to [11, 13]. By moving sweep line in Fortune algorithm,
two set of events are emerged; site event and circle event [11]. To find k
nearest neighbors in our algorithm, the developed circle events are
employed. There are specific circle events in the algorithm that are not
actual circle events named false alarm circle events. Our algorithm (see the
next section) deals efficiently with real circle events and in contrast doesn't
superfluously consider the false alarm circle event. A point on the plane is
inside a circle when its distance from the center of the circle is less than
radius of the circle. The vertices of a Voronoi diagram are the center of
encompassing triangles where each 3 points (sites) constitute the triangles.
The main purpose of our algorithm is to find out a circle in which the
desired query is located.
As the proposed algorithm does not need pre-processing, it’s completely
appropriate for dynamic environment where we can't endure very time
consuming pre-processing overheads. Because, as the readers may know, in
k-NN search methods a larger percent of time is dedicated to constructing a
data structure (usually in the form of a tree). This algorithm can be efficient,
especially when there are a large number of points while their motion is
considerable.
4.1 HOV-kNN algorithm
After describing our algorithm in the previous paragraph briefly, we will
elaborate it formally in this section. When the first order Voronoi diagram is
constructed, some of the query neighbors can be obtained in complexity of
the Fortune algorithm (i.e.𝑂(𝑛𝑙𝑜𝑔𝑛)). This fact forms the first step of our
algorithm. When the discovered circle event in HandleCircleEvent of the
Fortune algorithm is real (initialized by the variable “check” in line 6 of the
algorithm, and by default function HandleCircleEvent returns “true” when
circle even is real) the query distance is measured from center of the circle.
Moreover, when the condition in line 7.i of the algorithm is true, the three
points that constitute the circle are added to NEARS list if not been added
21. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 18
before (function PUSH-TAG (p) shows whether it is added to NEAR list or
not).
1) Input : q , a query
2) Output: list NEARS, k nearest neighbors.
3) Procedure :
4) Initialization :
5) NEARS ={}, K nearest neighbors
, Check = false, MOD = 0, V = {} (hold Voronoipoints( ;
6) Check = HandleCircleEvent()
7) If check= true, then -- detect a true circle event.
i) If distance(q , o) < r Then
(1) If PUSH-TAG(p1) = false , Then
(a) add p1 to NEARS
(2) If PUSH-TAG (p2) = false , Then
(a) add p2 to NEARS
ii) If PUSH-TAG(p3) = false, Then
(a) add p3 to NEARS
Real circle events are discovered up to this point and the points that
constitute the events are added to neighbor list of the query. As pointed out
earlier, the preferred result is obtained, if “k” inputs are equal or lesser than
number of the obtained neighbors a𝑂(𝑛𝑙𝑜𝑔𝑛)complexity.
8) if SIZE (NEARS) >= k , then
a. sort (NERAS ) - - sort NEARS by distance
b. for i = 1 to k
i. print (NEARS);
9) else if SIZE (NEARS) = k
ii. print(NEARS);
The algorithm enters the second step if the conditions of line 8 and 9 in the
first part are not met. The second part compute vertices of Voronoi
sequentially, so that the obtained vertices are HOV vertex. Under sequential
method for developing HOV [12], the vertices of the HOV are obtained by
omitting the closer neighbors. Here, however, to find more neighbors
through sequential method, loop one of the closest neighbor and loop one of
the farthest neighbor are deleted alternatively from the set of the point. This
leads to new circles that encompass the query. Afterward, the same
calculations described in section one are carried out for the remaining points
(the removed neighbors are recorded a list named REMOVED_POINTS).
The calculations are carried out until the loop condition in line 5 is met.
10) Else if (SIZE(NEARS) < k )
c. if mod MOD 2 = 0 , then
i. add nearest_Point to REMOVED_POINT ;
ii. Remove(P,nearest_Point);
d. if mod MOD 2 = 1 , then
22. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 19
i. add farthest_Point to REMOVED_POINT ;
ii. Remove(P,nearest_Point);
11) Increment MOD ;
12) produce line 6 to 9 from part1 for remind points P ;
13) Repeat until k >= SIZE _ LIST (NEARS) + SIZE _ LIST (REMOVED_POINT) ;
14) PRINT (NEARS) ;
Should the number of neighbors be less than required number of neighbors,
the algorithm starts the third part. At this part, Voronoi vertices and their
distance from query are recorded in a list. As explained for the first part of
the algorithm, the Voronoi vertices in the Fortune algorithm and their
distance to the query are enough to check realization of the condition of line
8. The vertices and their distance to the query are recorded. Following line
will be added after line 7 in the first part:
add pair(Voronoi_Vertex ,distance_To_Query) to List V
Moreover, along with adding input point to the list of the neighbors, their
distance to the query must be added to the list.
Using these two lists (after being filled, the lists can be ranked based on
their distance to query) the nearest point or Voronoi vertices is obtainable.
The nearest point can be considered as the input query and the whole
process of 1st
and 2nd
parts of the algorithm is repeated until required
number of neighbors is achieved. Finally, to have more number of
neighbors, the method can be repeated sequentially over the closer points to
the query. This part of the algorithm has the same complexities of the two
other sections as the whole process to find the preliminary query is repeated
for the representatives of the query.
Figure 4.implementation of HOVD
In Figure 4 "o" is a vertex of Voronoi and a center point of circle event that
is created by 𝑝1, 𝑝2 and 𝑝3. Based on algorithm the circle that encompasses
the query, add 𝑝1, 𝑝2 and 𝑝3 points as neighbors of query to the neighbors'
list. Here k is near to n, so by computing higher order of Voronoi, the circle
will be bigger and bigger. Thus farther neighbors are added to query
neighbors' list.
23. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 20
4.2 The complexity of HOV-kNN
As mentioned before, HOV-kNN algorithm has a time complexity lesser
than the time complexity of D.T. Lee’s algorithm. To show this fact,
consider the presented algorithm in the previous section. Line 13 explains
that the main body of algorithm must be repeated k times in which "k" are
the number of neighbors that should be found. In each repetition one of the
query’s neighbors are detected by algorithm and subsequently eliminated
from dataset. The principle part of our algorithm that is the most time
consuming part too is between lines 6 and 9. This line recalls modified
Fortune algorithm which has a time complexity𝑂(𝑛𝑙𝑜𝑔𝑛). Therefore the
overall complexity of our algorithm will be:
𝑂 𝑛𝑙𝑜𝑔𝑛
𝑘
𝑖=1
= 𝑂 𝑛𝑙𝑜𝑔𝑛 1
𝑘
𝑖=0
= 𝑘𝑂 𝑛𝑙𝑜𝑔𝑛 = 𝑂 𝑘𝑛𝑙𝑜𝑔𝑛 (2)
In comparison to the algorithm introduced in [12] (which has a time
complexity𝑂(𝑘2
𝑛𝑙𝑜𝑔𝑛)) our algorithm is faster k times. The main reason of
this difference is that Lee’s algorithm completely computes the HOVD,
while ours exploits a fraction of HOVD construction process. In term of
space complexity, the space complexity of our algorithm is the same as the
space complexity of Fortune algorithm: 𝑂(𝑛).
5. IMPLEMENTATION AND EVALUATION
This section introduces the results of the HOV-kNN algorithm and
compares the results with other algorithms. We use Voronoi diagram which
is used to find k nearest neighbor points that is less complicated. The
proposed algorithm was implemented using C++. For maintaining data
points vector data structure, which is one of the C++ standard libraries, was
used. The input data points used in the program test were adopted randomly.
To reach preferred data distribution, not too close/far points, they were
generated under specific conditions. For instance, for 100 input points, the
point generation range is 0-100 and for 500 input points the range is 0-500.
To ensure accuracy and validity of the output, a simple kNN algorithm was
implemented and the outputs of the two algorithms were compared (equal
input, equal query). Outputs evaluation was also carried out sequentially and
the outputs were stored in two separate files. Afterward, to compare
similarity rate, the two files were used as input to another program.
The evaluation was also conducted in two steps. First the parameter “k” was
taken as a constant and the evaluation was performed using different points
of data as input. As pictured in Figure 5, accuracy of the algorithm is more
than 90%. In this diagram, the number of inputs in dataset varies between 10
and 100000. At the second step, the evaluation was conducted with different
values of k, while the number of input data was stationary. Accuracy of the
algorithm was obtained 74% while “k” was between 10 and 500 (Figure 6).
24. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 21
Figure 5. The accuracy of the algorithm for constant k and different points of data as input
Figure 6. The accuracy of the algorithm for variable k and constant data as input
6. CONCLUSION AND FUTURE WORK
We have introduced a new algorithm (named HOV-kNN) with time
complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) and computing order k Voronoi diagram to find k
nearest neighbor in a set of N points in Euclidean space. The new proposed
algorithm finds k nearest neighbors in two stages: 1) during constructing the
first order Voronoi diagram, some of the query neighbors can be obtained in
complexity of the Fortune algorithm; 2) computing vertices of Voronoi
sequentially. Because of eliminating pre-processing steps, this algorithm is
significantly suitable for dynamic space in which data points are moving.
The experiments are done in twofold: 1) constant number of data points
while k is variable, and 2) variable number of data points while k is
constant. The obtained results show that this algorithm has sufficient
accuracy to be applied in real situation. In our future work we will try to
give a parallel version of our algorithm in order to efficiently
implementation a parallel machine to obtain more speed implementation.
Such an algorithm will be appropriate when the numbers of input points are
massive and probably distributed on a network of computers.
0%
20%
40%
60%
80%
100%
50
200
350
500
2000
5000
8000
20000
50000
80000
percent
input data
Accuracy
0%
20%
40%
60%
80%
100%
0 100 200 300 400 500
percent
k
Accuracy
25. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 22
REFERENCES
[1] Lifshits, Y.Nearest neighbor search: algorithmic perspective, SIGSPATIAL Special.
Vol. 2, No 2, 2010, 12-15.
[2] Shakhnarovich, G., Darrell, T., and Indyk, P.Nearest Neighbor Methods in Learning
and Vision: Theory and Practice, The MIT Press, United States, 2005.
[3] Andoni, A.Nearest Neighbor Search - the Old, the New, and the Impossible, Doctor of
Philosophy, Electrical Engineering and Computer Science, Massachusetts Institute of
Technology,2009.
[4] Bhatia, N., and Ashev, V. Survey of Nearest Neighbor Techniques, International
Journal of Computer Science and Information Security, Vol. 8, No 2, 2010, 1- 4.
[5] Dhanabal, S., and Chandramathi, S. A Review of various k-Nearest Neighbor Query
Processing Techniques, Computer Applications, Vol. 31, No 7, 2011, 14-22.
[6] Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., and Zhang, H.Fast approximate nearest-
neighbor search with k-nearest neighbor graph, In Proceedings of 22 international joint
conference on Artificial Intelligence, Vol. 2 (IJCAI'11), Toby Walsh (Ed.), 2011, 1312-
1317.
[7] Fukunaga, K. Narendra, P. M. A Branch and Bound Algorithm for Computing k-
Nearest Neighbors, IEEE Transactions on Computer,Vol. 24, No 7, 1975, 750-753.
[8] Song, Z., Roussopoulos, N. K-Nearest Neighbor Search for Moving Query Point, In
Proceedings of the 7th International Symposium on Advances in Spatial and Temporal
Databases (Redondo Beach, California, USA), Springer-Verlag, 2001, 79-96.
[9] Güting, R., Behr, T., and Xu, J. Efficient k-Nearest Neighbor Search on moving object
trajectories, The VLDB Journal 19, 5, 2010, 687-714.
[10]Frentzos, E., Gratsias, K., Pelekis, N., and Theodoridis, Y.Algorithms for Nearest
Neighbor Search on Moving Object Trajectories, Geoinformatica 11, 2, 2007,159-193.
[11]Berg, M. , Cheong, O. , Kreveld, M., and Overmars, M.Computational Geometry:
Algorithms and Applications, Third Edition, Springer-Verlag, 2008.
[12]Lee, D. T. On k-Nearest Neighbor Voronoi Diagrams in the Plane, Computers, IEEE
Transactions on Volume:C-31, Issue:6, 1982, 478–487.
[13]Fortune, S. A sweep line algorithm for Voronoi diagrams, Proceedings of the second
annual symposium on Computational geometry, Yorktown Heights, New York, United
States, 1986, 313–322.
[14]Meyerhenke, H. Constructing Higher-Order Voronoi Diagrams in Parallel,
Proceedings of the 21st European Workshop on Computational Geometry, Eindhoven,
The Netherlands, 2005, 123-126.
[15]Gao, Y., Zheng, B., Chen, G., and Li, Q. Algorithms for constrained k-nearest neighbor
queries over moving object trajectories, Geoinformatica 14, 2 (April 2010 ), 241-276.
This paper may be cited as:
Abbasifard, M. R., Naderi, H. and Mirjalili, M., 2014. HOV-kNN: A New
Algorithm to Nearest Neighbor Search in Dynamic Space. International
Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp.
12-22.
26. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 23
A Survey on Mobile Malware:
A War without End
Sonal Mohite
Sinhgad College of Engineering,
Vadgaon. Pune, India.
Prof. R. S. Sonar
Associate Professor
Sinhgad College of Engineering,
Vadgaon. Pune, India.
ABSTRACT
Nowadays, mobile devices have become an inseparable part of our everyday lives and its
usage has grown up exponentially. With the functionality upgrade of mobile phones, the
malware threat for mobile phones is expected to increase. This paper shades a light on
when and how the mobile malware got evolved. Current scenario of mobile operating
system shares’ and number and types of mobile malware are also described. Mobile
malware can be propagated via three communication media viz. SMS/MMS, Bluetooth/Wi-
Fi and FM-RDS. Several mobile malware detection techniques are explained with
implemented examples. When one uses the particular malware detection technique is
clarified along with its pros & cons. At first, static analysis of application is done and then a
dynamic analysis. If external ample resources are available then cloud-based analysis is
chosen. Application permission analysis and battery life monitoring are novel approaches
of malware detection. Along with malware detection, preventing mobile malware has
become critical. Proactive and reactive techniques of mobile malware control are defined
and explained. Few tips are provided to restrain malware propagation. Ultimately,
Structured and comprehensive overview of the research on mobile malware is explored.
Keywords
Mobile malware, malware propagation, malware control, malware detection.
1. INTRODUCTION
Before decades, computers were the only traditional devices used for
computing. Here and now, smart phones are used as supporting computing
devices with computers. With the increasing capabilities of such phones,
malware which was the biggest threat for computers is now become
widespread for smart phones too. The damage made by mobile malwares
includes theft of confidential data from device, eavesdropping of ongoing
conversation by third party, incurring extra charges through sending SMS to
premium rate numbers, and even location based tracking of user, which is
too severe to overlook. So there is a judicious requirement of understanding
the propagation means of mobile malware, various techniques to detect
mobile malware, and malware restraint.
27. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 24
2. RELATED WORKS
Malware is a malicious piece of software which is designed to damage the
computer system & interrupt its typical working. Fundamentally, malware is
a short form of Malicious Software. Mobile malware is a malicious software
aiming mobile phones instead of traditional computer system. With the
evolution of mobile phones, mobile malware started its evolution too [1-4].
When propagation medium is taken into account, mobile viruses are of three
types: Bluetooth-based virus, SMS-based virus, and FM RDS based virus
[5-9]. A BT-based virus propagates through Bluetooth & Wi-Fi which has
regional impact [5], [7], and [8]. On the contrary, SMS-based virus follows
long-range spreading pattern & can be propagated through SMS & MMS
[5], [6], [8]. FM RDS based virus uses RDS channel of mobile radio
transmitter for virus propagation [9]. Our work addresses the effect of
operational behavior of user & mobility of a device in virus propagation.
There are several methods of malware detection viz. static method, dynamic
method, cloud-based detection method, battery life monitoring method,
application permission analysis, enforcing hardware sandbox etc. [10-18]. In
addition to work given in [10-18], our work addresses pros and cons of each
malware detection method. Along with the study of virus propagation &
detection mechanisms, methods of restraining virus propagation are also
vital. A number of proactive & reactive malware control strategies are given
in [5], [10].
3. EVOLUTION OF MOBILE MALWARE
Although, first mobile malware, ‘Liberty Crack’, was developed in year
2000, mobile malware evolved rapidly during years 2004 to 2006 [1].
Enormous varieties of malicious programs targeting mobile devices were
evolved during this time period & are evolving till date. These programs
were alike the malware that targeted traditional computer system: viruses,
worms, and Trojans, the latter including spyware, backdoors, and adware.
At the end of 2012, there were 46,445 modifications in mobile malware.
However, by the end of June 2013, Kaspersky Lab had added an aggregate
total of 100,386 mobile malware modifications to its system [2]. The total
mobile malware samples at the end of December 2013 were 148,778 [4].
Moreover, Kaspersky labs [4] have collected 8,260,509 unique malware
installation packs. This shows that there is a dramatic increase in mobile
malware. Arrival of ‘Cabir’, the second most mobile malware (worm)
developed in 2004 for Symbian OS, dyed-in-the-wool the basic rule of
computer virus evolution. Three conditions are needed to be fulfilled for
malicious programs to target any particular operating system or platform:
28. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 25
The platform must be popular: During evolution of ‘Cabir’, Symbian
was the most popular platform for smart phones. However,
nowadays it is Android, that is most targeted by attackers. These
days’ malware authors continue to ponder on the Android platform
as it holds 93.94% of the total market share in mobile phones and
tablet devices.
There must be a well-documented development tools for the
application: Nowadays every mobile operating system developers
provides a software development kit & precise documentation which
helps in easy application development.
The presence of vulnerabilities or coding errors: During the
evolution of ‘Cabir’, Symbian had number of loopholes which was
the reason for malware intrusion. In this day and age, same thing is
applicable for Android [3].
Share of operating system plays a crucial role in mobile malware
development. Higher the market share of operating system, higher is the
possibility of malware infection. The pie chart below illustrates the
operating system (platform) wise mobile malware distribution [4]:
Figure 1. OS wise malware distribution
29. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 26
4. MOBILE MALWARE PROPAGATION
There are 3 communication channels through which malware can propagate.
They are: SMS / MMS, Bluetooth / Wi-Fi, and FM Radio broadcasts.
4.1 SMS / MMS
Viruses that use SMS as a communication media can send copies of
themselves to all phones that are recorded in victim’s address book. Virus
can be spread by means of forwarding photos, videos, and short text
messages, etc. For propagation, a long-range spreading pattern is followed
which is analogous to the spreading of computer viruses like worm
propagation in e-mail networks [6]. For accurate study of SMS-based virus
propagation, one needs to consider certain operational patterns, such as
whether or not users open a virus attachment. Hence, the operational
behavior of users plays a vital role in SMS-based virus propagation [8].
4.1.1 Process of malware propagation
If a phone is infected with SMS-based virus, the virus regularly sends its
copies to other phones whose contact number is found in the contact list of
the infected phone. After receiving such distrustful message from others,
user may open or delete it as per his alertness. If user opens the message, he
is infected. But, if a phone is immunized with antivirus, a newly arrived
virus won’t be propagated even if user opens an infected message.
Therefore, the security awareness of mobile users plays a key role in SMS-
based virus propagation.
Same process is applicable for MMS-based virus propagation whereas
MMS carries sophisticated payload than that of SMS. It can carry videos,
audios in addition to the simple text & picture payload of SMS.
4.2 Bluetooth/ Wi-Fi
Viruses that use Bluetooth as a communication channel are local-contact
driven viruses since they infect other phones within its short radio range.
BT-based virus infects individuals that are homogeneous to sender, and each
of them has an equal probability of contact with others [7]. Mobility
characteristics of user such as whether or not a user moves at a given hour,
probability to return to visited places at the next time, traveling distances of
a user at the next time etc. are need to be considered [8].
4.2.1 Process of malware propagation
Unlike SMS-based viruses, if a phone is infected by a BT-based virus, it
spontaneously & atomically searches another phone through available
Bluetooth services. Within a range of sender mobile device, a BT-based
virus is replicated. For that reason, users’ mobility patterns and contact
30. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 27
frequency among mobile phones play crucial roles in BT-based virus
propagation.
Same process is followed for Wi-Fi where Wi-Fi is able to carry high
payload in large range than that of BT.
4.3 FM-RDS
Several existing electronic devices do not support data connectivity facility
but include an FM radio receiver. Such devices are low-end mobile phones,
media players, vehicular audio systems etc. FM provides FM radio data
system (RDS), a low-rate digital broadcast channel. It is proposed for
delivering simple information about the station and current program, but it
can also be used with other broad range of new applications and to enhance
existing ones as well [9].
4.3.1 Process of malware propagation
The attacker can attack in two different ways. The first way is to create a
seemingly benign app and upload it to popular app stores. Once the user
downloads & installs the app, it will contact update server & update its
functionality. This newly added malicious functionality decodes and
assembles the payload. At the end, the assembled payload is executed by the
Trojan app to uplift privileges of attacked device & use it for malicious
purpose. Another way is, the attacker obtains a privilege escalation exploit
for the desired target. As RDS protocol has a limited bandwidth, we need to
packetize the exploit. Packetization is basically to break up a multi-kilobyte
binary payload into several smaller Base64 encoded packets. Sequence
numbers are attached for proper reception of data at receiver side. The
received exploit is executed. In this way the device is infected with malware
[9].
5. MOBILE MALWARE DETECTION TECHNIQUE
Once the malware is propagated, malware detection is needed to be carried
out. In this section, various mobile malware detection techniques are
explained.
5.1 Static Analysis Technique
As the name indicates, static analysis is to evaluate the application without
execution [10-11]. It is an economical as well as fast approach to detect any
malevolent characteristics in an application without executing it. Static
analysis can be used to cover static pre-checks that are performed before the
application gets an entry to online application markets. Such application
markets are available for most major smartphone platforms e.g. ‘Play store’
for Android, ‘Store’ for windows operating system. . These extended pre-
31. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 28
checks enhance the malware detection probabilities and therefore further
spreading of malware in the online application stores can be banned. In
static analysis, the application is investigated for apparent security threats
like memory corruption flaws, bad code segment etc. [10], [12].
5.1.1 Process of malware detection
If the source code of application is available, static analysis tools can be
directly used for further examination of code.
But if the source code of the application is not available then executable app
is converted back to its source code. This process is known as
disassembling. Once the application is disassembled, feature extraction is
done. Feature extraction is nothing but observing certain parameters viz.
system calls, data flow, control flow etc. Depending on the observations,
anomaly is detected. In this way, application is categorized as either benign
or malicious.
Pros: Economical and fast approach of malware detection.
Cons: Source codes of applications are not readily available. And
disassembling might not give exact source codes.
Figure 2. Static Analysis Technique
5.1.2 Example
Figure 2 shows the malware detection technique proposed by Enck et al.
[12] for Android. Application’s installation image (.apk) is used as an input
to system. Ded, a Dalvik decompiler, is used to dissemble the code. It
32. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 29
generates Java source code from .apk image. Feature extraction is done by
using Fortify SCA. It is a static code analysis suite that provides four types
of analysis; control flow analysis, data flow analysis, structural analysis, and
semantic analysis. It is used to evaluate the recovered source code &
categorize the application as either benign or malicious.
5.2 Dynamic Analysis Technique
Dynamic analysis comprises of analyzing the actions performed by an
application while it is being executed. In dynamic analysis, the mobile
application is executed in an isolated environment such as virtual machine
or emulator, and the dynamic behavior of the application is monitored [10],
[11], [13]. There are various methodologies to perform dynamic analysis
viz. function call monitoring, function parameter analysis, Information flow
tracking, instruction trace etc. [13].
5.2.1 Process of malware detection
Dynamic analysis process is quite diverse than the static analysis. In this,
the application is installed in the standard Emulator. After installation is
done, the app is executed for a specific time and penetrated with random
user inputs. Using various methodologies mentioned in [13], the application
is examined. On the runtime behavior, the application is either classified as
benign or malicious.
Pros: Comprehensive approach of malware detection. Most of the malwares
is got detected in this technique.
Cons: Comparatively complex and requires more resources.
Figure 3. Dynamic Analysis Technique
33. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 30
5.2.2 Example
Figure 3 shows Android Application Sandbox (AASandbox) [14], the
dynamic malware detection technique proposed by Blasing et al. for
Android. It is a two-step analysis process comprising of both static &
dynamic analysis. The AASandbox first implements a static pre-check,
followed by a comprehensive dynamic analysis. In static analysis, the
application image binary is disassembled. Now the disassembled code is
used for feature extraction & to search for any distrustful patterns. After
static analysis, dynamic analysis is performed. In dynamic analysis, the
binary is installed and executed in an AASandbox. ‘Android Monkey’ is
used to generate runtime inputs. System calls are logged & log files are
generated. This generated log file will be then summarized and condensed to
a mathematical vector for better analysis. In this way, application is
classified as either benign or malicious.
5.3 Cloud-based Analysis Technique
Mobile devices possess limited battery and computation. With such
constrained resource availability, it is quite problematic to deploy a full-
fledged security mechanism in a smartphone. As data volume increases, it is
efficient to move security mechanisms to some external server rather than
increasing the working load of mobile device [10], [15].
5.3.1 Process of malware detection
In the cloud-based method of malware detection, all security computations
are moved to the cloud that hosts several replicas of the mobile phones
running on emulators & result is sent back to mobile device. This increases
the performance of mobile devices.
Pros: Cloud holds ample resources of each type that helps in more
comprehensive detection of malware.
Cons: Extra charges to maintain cloud and forward data to cloud server.
5.3.2 Example
Figure 4 shows Paranoid Android (PA), proposed by Portokalidis et al. [15].
Here, security analysis and computations are moved to a cloud (remote
server). It consists of 2 different modules, a tracer & replayer. A tracer is
located in each smart phone. It records all necessary information that is
required to reiterate the execution of the mobile application on remote
server. The information recorded by tracer is first filtered & encoded. Then
it is stored properly and synchronized data is sent to replayer over an
encrypted channel. Replayer is located in the cloud. It holds the replica of
mobile phone running on emulator & records the information communicated
by tracer. The replayer replays the same execution on the emulator, in the
34. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 31
cloud. Cloud, the remote server, owns abundant resources to perform
multifarious analysis on the data collected from tracer. During the replay,
numerous security analyses such as dynamic malware analysis, memory
scanners, system call tracing, call graph analysis[15] etc. are performed
rather there is no limit on the number of attack detection techniques that we
can be applied in parallel.
Figure 4. Cloud-based Detection Technique
5.4 Monitoring Battery Consumption
Monitoring battery life is a completely different approach of malware
detection compared to other ones. Usually smartphones possess limited
battery capacity and need to be used judiciously. The usual user behavior,
existing battery state, signal strength and network traffic details of a mobile
is recorded over time and this data can be effectively used to detect hidden
malicious activities. By observing current energy consumption such
malicious applications can indeed be detected as they are expected to take in
more power than normal regular usage. Though, battery power consumption
is one of the major limitations of mobile phones that limit the complexity of
anti-malware solutions. A quite remarkable work is done in this field. The
introductory exploration in this domain is done by Jacoby and Davis [16].
5.4.1 Process of malware detection
After malware infection, that greedy malware keeps on repeating itself. If
the mean of propagation is Bluetooth then the device continuously scans for
35. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 32
adjacent Bluetooth-enabled devices which in turn consume a remarkable
amount of power. This time-domain data of power consumption collected
over a period of time is transformed into frequency-domain data &
represented as dominant frequencies. The malwares are identified from
these certain dominant frequencies.
Pros: Economical and novel approach of malware detection.
Cons: Because of multi-functionality of smart phones, power consumption
model of smart phone could not be accurately defined.
5.4.2 Example
Recent work by Liu et al. [17] proposed another detection technique by
comparing the compressed sequences of the power consumption value in
each time interval. They defined a user-centric power model that relies on
user actions. User actions such as duration & frequency of calls, number of
SMS, network usage are taken into account. Their work uses machine
learning techniques to generate rules for malware detection.
5.5 Application Permission Analysis
With the advancements in mobile phone technology, users have started
downloading third party application. These applications are available in
third party application stores. While developing any application, application
developers need to take required permissions from device in order to make
the application work on that device. Permissions hold a crucial role in
mobile application development as they convey the intents and back-end
activities of the application to the user. Permissions should be precisely
defined & displayed to the user before the application is installed. Though,
some application developers hide certain permissions from user & make the
application vulnerable & malicious application.
5.5.1 Process of malware detection
Security configuration of an application is extracted. Permissions taken by
an application are analyzed. If application has taken any unwanted
applications then it is categorized as malicious.
Pros: Fewer resources are required compared to other techniques.
Cons: Analyzing only the permissions request is not adequate for mobile
malware detection; it needs to be done in parallel with static and/or dynamic
analysis.
5.5.2 Example
Kirin, proposed by Enck et al. (2009) [18] is an application certification
system for Android. During installation, Kirin crisscrosses the application
permissions. It extracts the security configurations of the application
36. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 33
&checks it against the templates i.e. security policy rules already defined by
Kirin. If any application becomes unsuccessful to clear all the security
policy rules, Kirin either deletes the application or alerts the user for
assistance [18].
6. MOBILE MALWARE CONTROL STRATEGIES
Basically, there are two types of malware control strategies, viz. proactive &
reactive control. In proactive malware control strategy, malware is mitigated
before its propagation. Proper set of preventive measures is used for this
purpose. While, in reactive malware control strategy, malware is first
propagated and then a reaction is taken upon malware contamination.
6.1 Proactive Malware Control Strategy
Here are some of the proactive malware control techniques given in [10];
however, users’ own security awareness plays a crucial role.
Install a decent mobile security application i.e. antivirus.
Always download apps from trusted official application markets.
Before downloading any app, do read the reviews and ratings of the
app. During installation, always remember to read the permissions
requested by the app and if it appears doubtful don’t install it.
Always keep installed apps up-to-date.
Turn-off Wi-Fi, Bluetooth, and other short range wireless
communication media when not to be used. Stay more conscious
when connecting to insecure public Wi-Fi networks & accepting
Bluetooth data from unknown sender.
When confidential data is to be stored in the mobile phone, encrypt it
before storing and set a password for access. Do regular back-ups.
Assure that the sensitive information is not cached locally in the
mobile phone.
Always keep an eye on the battery life, SMS and call charges, if
found any few and far between behaviors, better go for an in-depth
check on the recently installed applications.
During internet access, don’t click on links that seem suspicious or
not trustworthy.
Finally, in case of mobile phone theft, delete all contacts,
applications, and confidential data remotely.
6.2Reactive Malware Control Strategy
When the malware is detected then the control strategy is implemented, is
the working principle of reactive malware control strategy. Antivirus
solution comes under proactive malware control, however when a new
37. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 34
malware is found, antivirus updates for that malware are implemented and
forwarded to mobile phones, is a part of reactive malware control. This is
known as adaptive patch dissemination.
Adaptive Patch Dissemination
A pre-immunization like antivirus is used to protect networks before virus
propagation. However, in reality, we first detect certain viruses and then
update antivirus, known as patches. These patches are forwarded into
networks only after these viruses have already propagated. Network
bandwidth limits the speed with which the security notifications or patches
can be sent to all users simultaneously. Therefore, a new strategy namely
adaptive dissemination strategy is developed. It is based on the Autonomy
Oriented Computing (AOC) methodology which helps to send security
notifications or patches to most of phones with a relatively lower
communication cost. The AOC is used to search a set of the highly
connected phones with large communication abilities in a mobile network
[5].
7. CONCLUSION
Rapid growth in smart phone development resulted in evolution of mobile
malware. Operating system shares’ plays crucial role in malware evolution.
SMS/MMS is the fastest way of mobile malware propagation as it has no
geographical boundary like BT/Wi-Fi. FM-RDS is still evolving. Among all
malware detection techniques, static malware detection is performed first
during pre-checks. Later dynamic analysis is performed and can be
combined with application permission analysis. Cloud-based analysis is
more comprehensive approach as it uses external resources to perform
malware detection and can perform more than one type of analysis
simultaneously. Proactive control strategy is used to control malware before
its propagation while reactive control strategy is used after malware is
propagated.
REFERENCES
[1] La Polla, M., Martinelli, F., & Sgandurra, D. (2012). A survey on security for mobile
devices. IEEE Communications Surveys & Tutorials, 15(1), 446 – 471.
[2] Kaspersky Lab IT Threat Evolution: Q2 2013. (2013). Retrieved from
http://www.kaspersky.co.in/about/news/virus/2013/kaspersky_lab_it_threat_evolution_q2_
2013.
[3] Kaspersky Security Bulletin 2013: Overall statistics for 2013. (2013 December).
Retrieved from
http://www.securelist.com/en/analysis/204792318/Kaspersky_Security_Bulletin_2013_Ove
rall_statistics_for_2013.
38. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 35
[4] Maslennikov, D. Mobile Malware Evolution: Part 6. (2013 February). Retrieved from
http://www.securelist.com/en/analysis/ 204792283/Mobile_Malware_Evolution_Part_6.
[5] Gao, C., and Liu, J. (2013). Modeling and restraining mobile virus propagation. IEEE
transactions on mobile computing, 12(3), 529-541.
[6] Gao, C. and Liu, J. (2011). Network immunization and virus propagation in Email
networks: Experimental evaluation and analysis. Knowledge and information systems,
27(2), 253-279.
[7] Yan, G., and Eidenbenz, S. (2009, March). Modeling propagation dynamics of
Bluetooth worms (extended version). IEEE transactions on Mobile Computing, 8(3), 353-
368.
[8] Gonzalez, M., Hidalgo, C., and Barabasi, A. (2008). Understanding individual human
mobility patterns. Nature, 453(7196), 779-782.
[9] Fernandes, E., Crispo, B., Conti, M. (2013, June). FM 99.9, Radio virus: Exploiting
FM radio broadcasts for malware deployment. Transactions on information forensics and
security, 8(6), 1027-1037.
[10] Chandramohan, M., and Tan, H. (2012). Detection of mobile malware in the wild.
IEEE computer society, 45(9), 65-71.
[11] Yan, Q., Li, Y., Li, T., and Deng, R. (2009). Insights into malware detection and
prevention on mobile phones. Springer-Verlag Berlin Heidelberg, SecTech 2009, 242–249.
[12] Enck, W., Octeau, D., Mcdaniel, P., and Chaudhuri, S. (2011 August). A study of
android application security. The 20th Usenix security symposium.
[13] Egele, M., Scholte, T., Kirda, E., Kruegel, C. (2012 February). A survey on automated
dynamic malware-analysis techniques and tools. ACM-TRANSACTION, 4402(06), 6-48.
[14] Blasing, T., Batyuk, L., Schmidt, A., Camtepe, S., and Albayrak, S. (2010). An
android application sandbox system for suspicious software detection. 5th International
Conference on Malicious and Unwanted Software.
[15] Portokalidis, G., Homburg, P., Anagnostakis, K., Bos, H. (2010 December). Paranoid
android: Versatile protection for smartphones. ACSAC'10.
[16] Jacoby, G. (2004). Battery-based intrusion detection. The Global Telecommunications
Conference.
[17] Liu, L., Yan, G., Zhang, X., and Chen, S. (2009). Virusmeter: Preventing your
cellphone from spies. RAID, 5758, 244-264.
[18] Enck, W., Ongtang, M., and Mcdaniel, P. (2009 November). On lightweight mobile
phone application certification. 16th ACM Conference on Computer and Communications
Security.
This paper may be cited as:
Mohite, S. and Sonar, R. S., 2014. A Survey on Mobile Malware: A War
without End. International Journal of Computer Science and Business
Informatics, Vol. 9, No. 1, pp. 23-35.
39. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 36
An Efficient Design Tool to Detect
Inconsistencies in UML Design Models
Mythili Thirugnanam
Assistant Professor (Senior)
School of Computing Science and Engineering
VIT University,Vellore, Tamil Nadu
Sumathy Subramaniam
Assistant Professor (SG)
School of Information Technology and Engineering
VIT University, Vellore, Tamil Nadu
ABSTRACT
Quality of any software developed is evaluated based on the design aspect. Design is one of
the most important phases in software life cycle. Poor process design leads to high failure
rate of the software. To design the software, various traditional and UML models are
widely used. There are many tools proposed and are available to design the UML models as
per the user requirements. However, these tools do not support validation of UML models
which, ultimately leads to design errors. Most of the existing testing tools check for
consistency of the UML models. Some tools check for inconsistency of the UML models
that does not follow the consistency rule required for UML models. The proposed work
aims to develop an efficient tool, which detects the inconsistency in the given UML
models. Parsing techniques are applied to extract the XML tags. The extracted tags contain
relevant details such as class name, attribute name, operation name and the association with
their corresponding names in Class diagram in the Meta model format. On adopting the
consistency rules for the given input UML model, inconsistency is detected and a report is
generated. From the inconsistency report, error efficiency and design efficiency is
computed.
Keywords
Software Design, Unified Modeling Language (UML), Testing, Extensible Markup
Language (XML).
1. INTRODUCTION
In present day scenario, software programming is moving towards high-
level design, which raises new research issues and a scope for developing
new set of tools that supports design specification. Most research in
software specification use verification and validation techniques to prove
correctness in terms of certain properties. The delivery of high-quality
software product is a major goal in software engineering. An important
aspect is to achieve error free software product that assures quality of the
software. Inspection and testing are common verification and validation (V
& V) approaches for defect detection in the software development process.
Existing statistical data shows that the cost of finding and repairing
software bugs raises drastically in later development stages. The Unified
40. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 37
Modeling Language (UML) is now widely accepted as the standard
modeling language for software construction and is gaining wide
acceptance. The class diagram in its core view provides the backbone for
any modeling effort and has well formed semantics.
2. BACKGROUND STUDY
Alexander Egyed [4, 5] presents an automated approach for detecting and
tracking inconsistencies in real time and to automatically identify changes in
various models that affect the consistency rules. The approach observes the
behavior of consistency rules to understand how they affect the model.
Techniques for efficiently detecting inconsistencies in UML Models
identifying the changes required to fix problems are analyzed. The work
describes a technique for automatically generating a set of concrete changes
for fixing inconsistencies and providing information about the impact of
each change on all consistency rules. The approach is integrated with the
design tool IBM Rational Rose TM. Muhammad Usman [9] presents a
survey of UML consistency checking techniques by analyzing various
parameters and constructs an analysis table. The analysis table helps
evaluate existing consistency checking techniques and concludes that most
of the approaches validate intra and inter level consistencies between UML
models by using monitoring strategy. UML class, sequence, and state chart
diagrams are used in most of the existing consistency checking techniques.
Alexander Egyed demonstrates [3] that a tool can assist the designer in
discovering unintentional side effects, locating choices for fixing
inconsistencies, and then in changing the design model.
The paper examines the impact of changes on UML design models [10] and
explores the methodology to discover the negative side effects of design
changes, and to predict the positive and negative impact of these choices.
Alexander Egyed [1, 2] presents an approach for quickly, correctly, and
automatically deciding the consistency rules required to evaluate when a
model changes. The approach does not require consistency rules with
special annotations. Instead, it treats consistency rules as black-box entities
and observes their behavior during their evaluation to identify the different
types of model elements they access.
Christian Nentwich [6, 7] presents a repair framework for inconsistent
distributed documents for generating interactive repairs from full first order
logic formulae that constrain the documents. A full implementation of the
components as well as their application to the UML and related
heterogeneous documents such as EJB deployment descriptors are
presented. This approach can be used as an infrastructure for building high
domain specific frameworks. Researchers have focused to remove
41. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 38
inconsistencies in few UML Models. The work proposed in [11] attempts to
address and detect inconsistencies in UML Models like Class diagram, Use
case diagram, Sequence diagram and so on. A survey exploring the impact
of model driven software development is given in [12]. Change in impact
analysis, consistency management and uncertainty management,
inconsistency detection and resolution rules are dealt in the work.
3. FRAME WORK OF THE PROPOSED WORK
Figure 1. Framework of the proposed work
4. DETAILED DESCRIPTION OF THE PROPOSED WORK
The framework of the proposed work is given in Figure 1.
4.1. Converting UML model into XML file
An UML design diagram does not support to directly detect the
inconsistency which is practically impossible. UML model is converted into
XML file for detecting the inconsistency in the model. UML models such as
use case diagram, class diagram and sequence diagram can be taken as input
for this tool. The final output of this module is XML file which is used
further to detect the inconsistency. The snapshot of getting input file is
shown in Figure 2.
Extract the XML tags
Apply parsing
Technique
Applying consistency
rules
Detect Inconsistency in the
given input
Generate the
Inconsistency report
Select UML model Convert UML model into
XML file
42. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 39
Procedure used:
Convert the chosen input design into a XML file
Select Input File Export as XML file VP-UML project
Select the diagram that needs to be exported
Select the location for exported file to be stored
The input file is read from the user to carry out further process (Figure 2).
Here, Use Case Diagram is read as input file. The input diagram is stored
as XML file and passed as the input to the next process that extracts the
XML tags.
4.2. Extracting the XML tags and applying the parsing technique
From the XML file, the XML tags are extracted. The parsing technique is
applied on the XML tags to identify the related information of the given
model which is in Meta model format [3]. For example, in class diagram,
the class name, its attributes and methods are identified. All the related
information of the given input model is extracted.
Procedure used:
Open the XML file
Copy the file as text file
Split the tag into tokens Extract the relevant information about
the diagram
Save the extracted result in a file.
Figure 3 & 4 describes the above mentioned procedure. The XML file is
considered as the input for this step. This method adopts the tokenizer
concept to split the tags and store.
4.3. Detecting the design inconsistency:
The consistency rules [8, 10] are applied on the related information of the
given input design diagram to detect the inconsistency. The related
information which does not satisfy the rule has design inconsistency for the
given input model. All possible inconsistency is detected as described
below. Figure 5 shows the inconsistencies in given use case diagram.
4.3.1. Consistency rule for the Class Diagram:
Visibility of a member should be given.
Visibility of all attributes should be private.
Visibility of all methods should be public.
Associations should have cardinality relationship.
When one class depends on another class, there should be class
interfaces notation.
43. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 40
4.3.2. Consistency rule for the Use Case Diagram
Every actor has at least one relationship with the use case.
System boundary should be defined.
All the words that suggest incompleteness should be removed
such as some and etc.
4.3.3. Consistency rule for the Sequence Diagram
All objects should have at least one interaction with any other object
For each message proper parameters should be included
Procedure used:
Select the Input design model
Based on the chosen design model (Class diagram, Use case diagram
and Sequence diagram) inconsistency is detected and the extracted
result is compared with given consistency rule.
4.4. Generating the inconsistency report
A collective report is generated for all the inconsistencies that are detected
in the given input model. The report provides the overall inconsistency of
the given input model which is taken care during the implementation.
4.5. Computing Design Efficiency
The total number of possible errors in the design model is estimated [10].
Then the total number of errors found in the input design model is
determined with the procedures discussed. The error efficiency is computed
using equation 1. From the calculated error efficiency of the design, the
design efficiency is computed using equation 2. The implementation of the
same is shown in Figure 6.
[eq 1]
[eq 2]
5. RESULTS & DISCUSSION
In the recent past there has been a blossoming development of new
approaches in software design and testing. The proposed system primarily
aims to detect the inconsistency which provides efficient design
specification. Though there is a lot of research going on in detecting
inconsistencies in various UML models, not much work is carried out in
Use Case diagram & Class diagram. The developed system doesn’t have
44. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 41
any constraint on the maximum number of lines of code. This added feature
makes this tool more versatile when compared with the existing tools.
Various design models for different scenarios were taken as samples and
tested for consistency. The results obtained proved that the developed tool
was able to detect all the inconsistencies available in the given input model.
Figure 2. Selecting input model (UML model is the chosen Use Case Design)
Figure 3. Snapshot shows the XML Format file that extracted from the input UML Model
45. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 42
Figure 4. Snapshot shows relavent information obtained from the given design from XML file
Figure 5. Snapshot shows inconsistency details for the given input design
46. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 43
Figure 6 . Snapshot shows efficency of the given input design model
6. CONCLUSION AND FUTURE ENHANCEMENT
Inspection and testing of the software are the important approaches in
software engineering practice that addresses to reduce the amount of defects
in software products. Software inspection focuses on design specifications
in early phases of software development whereas traditional testing
approaches focus on implementation phases or later. Software inspection is
widely regarded as an effective defect finding technique. Recent research
has considered the application of tool support as a means to increase its
efficiency. During the design model, construction and validation of variety
of faults can be found. Testing at the early phase in software life cycle not
only increases quality but also reduces the cost incurred. The developed tool
can help to enforce the inspection process and provide support for finding
defects in the design model, and also compute the design efficiency on
deriving the error efficiency. This work would take care of the major
constraints imposed while creating design models such as class diagram, use
case diagram and sequence diagram. Further enhancement of the proposed
work is to address the other major constraints in class diagrams such as
inheritance, association, cardinality constraints and so on.
REFERENCES
[1] A.Egyed and D.S.Wile, Supporting for Managing Design-Time Decision, IEEE
Transactions on Software Engineering, 2006.
[2] A.Egyed, Fixing Inconsistencies in UML Design Models, ICSE, 2007.
47. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 44
[3] A.Egyed, Instant Consistency Checking for UML, Proceedings of the International
Conference on Software Engineering, 2006.
[4] A.Egyed, E.Letier, A.Finkelstein, Generating and Evaluating Choices for Fixing
Inconsisentices in UML Design Models, International Conference on Software
Engineering, 2008.
[5] A Egyed, Automatically Detecting and Tracking Inconsistencies in Software Design
Models IEEE Transactions on Software Engineering, ISSN: 0098-5589, 2009.
[6] C.Nentwich, I.Capra and A.Finkelstein, xlinkit: a consistency checking and smart link
generation service, ACM transactions on Internet Technology, 2002.
[7] C.Nentwich, W. Emmerich and A.Finkelstein, Consistency Management with Repair
Actions, ICSE, 2003.
[8] Diana kalibatiene , Olegas Vasilecas , Ruta Dubauskaite , Ensuring Consistency in
Different IS models – UML case study , Baltic J.Modern Computing , Vol.1 , No.1-
2,pp.63-76 ,2013.
[9] Muhammad Usman, Aamer Nadeem, Tai-hoon Kim, Eun-suk Cho, A Survey of
Consistency Checking Techniques for UML Models , Advanced Software Engineering
& Its Applications,2008.
[10]R. Dubauskaite, O.Vasilecas, Method on specifying consistency rules among different
aspect models, expressed in UML, Elektronika ir elekrotechnika , ISSN 1392 -1215.
Vol.19, No.3, 2013.
[11]Rumbaugh, J., Jacobson, I., Booch, G., The Unified Modeling Language Reference
Manual. AddisonWesley, 1999.
[12] Amal Khalil and Juergen Dingel, Supporting the evolution of UML models in model
driven software developmeny: A Survey, Technical Report, School of Computing,
Queen’s University, Canada, Feb 2013.
This paper may be cited as:
Thirugnanam, M. and Subramaniam, S., 2014. An Efficient Design Tool to
Detect Inconsistencies in UML Design Models. International Journal of
Computer Science and Business Informatics, Vol. 9, No. 1, pp. 36-44.
48. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 45
An Integrated Procedure for Resolving
Portfolio Optimization Problems using
Data Envelopment Analysis, Ant
Colony Optimization and Gene
Expression Programming
Chih-Ming Hsu
Minghsin University of Science and Technology
1 Hsin-Hsing Road, Hsin-Fong, Hsinchu 304, Taiwan, ROC
ABSTRACT
The portfolio optimization problem is an important issue in the field of investment/financial
decision-making and is currently receiving considerable attention from both researchers and
practitioners. In this study, an integrated procedure using data envelopment analysis (DEA),
ant colony optimization (ACO) for continuous domains and gene expression programming
(GEP) is proposed. The procedure is evaluated through a case study on investing in stocks
in the semiconductor sub-section of the Taiwan stock market. The potential average six-
month return on investment of 13.12% from November 1, 2007 to July 8, 2011 indicates
that the proposed procedure can be considered a feasible and effective tool for making
outstanding investment plans. Moreover, it is a strategy that can help investors make profits
even though the overall stock market suffers a loss. The present study can help an investor
to screen stocks with the most profitable potential rapidly and can automatically determine
the optimal investment proportion of each stock to minimize the investment risk while
satisfying the target return on investment set by an investor. Furthermore, this study fills the
scarcity of discussions about the timing for buying/selling stocks in the literature by
providing a set of transaction rules.
Keywords
Portfolio optimization, Data envelopment analysis, Ant colony optimization, Gene
expression programming.
1. INTRODUCTION
Portfolio optimization is a procedure that aims to find the optimal
percentage asset allocation for a finite set of assets, thus giving the highest
return for the least risk. It is an important issue in the field of
investment/financial decision-making and currently receiving considerable
attention from both researchers and practitioners. The first parametric model
applied to the portfolio optimization problem was proposed by Harry M.
Markowitz [1]. This is the Markowitz mean-variance model, which is the
foundation for modern portfolio theory. The non-negativity constraint
makes the standard Markowitz model NP-hard and inhibits an analytic
49. International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 46
solution. Although quadratic programming can be used to solve the problem
with a reasonably small number of different assets, it becomes much more
difficult if the number of assets is increased or if additional constraints, such
as cardinality constraints, bounding constraints or other real-world
requirements, are introduced.
Therefore, various approaches for tackling portfolio optimization problems
using heuristic techniques have been proposed. For example,
Anagnostopoulos and Mamanis [2] formulated the portfolio selection as a
tri-objective optimization problem that aims to simultaneously maximize the
expected return, as well as minimize risk and the number of assets held in
the portfolio. In addition, their proposed model also considered quantity
constraints and class constraints intended to limit the proportion of the
portfolio invested in assets with common characteristics and to avoid very
small holdings. The experimental results and a comparison revealed that
SPEA2 (strength Pareto evolutionary algorithm 2) [4] is the best algorithm
both for the constrained and unconstrained portfolio optimization problem,
while PESA (Pareto envelope-based selection algorithm) [3] is the runner-
up and the fastest approach of all models compared. Deng and Lin [5]
proposed an approach for resolving the cardinality constrained Markowitz
mean-variance portfolio optimization problem based on the ant colony
optimization (ACO) algorithm. Their proposed method was demonstrated
using test data from the Hang Seng 31, DAX 100, FTSE 100, S&P 100, and
Nikkei 225 indices from March 1992 to September 1997, which yielded
adequate results. Chen et al.[6]proposed a decision-making model of
dynamic portfolio optimization for adapting to the change of stock prices
based on time adapting genetic network programming (TA-GNP) to
generate portfolio investment advice. They determined the distribution of
initial capital to each brand in the portfolio, as well as to create trading rules
for buying and selling stocks on a regular basis, by using technical indices
and candlestick chart as judgment functions. The effectiveness and
efficiency of their proposed method was demonstrated by an experiment on
the Japanese stock market. The comparative results clarified that the TA-
GNP generates more profit than the traditional static GNP, genetic
algorithms (GAs), and the Buy & Hold method. Sun et al. [7] modified the
update equations of velocity and position of the particle in particle swarm
optimization (PSO) and proposed the drift particle swarm optimization
(DPSO) to resolve the multi-stage portfolio optimization (MSPO) problem
where transactions take place at discrete time points during the planning
horizon. The authors illustrated their approach by conducting experiments
on the problem with different numbers of stages in the planning horizon
using sample data collected from the S&P 100 index. The experimental
results and a comparison indicated that the DPSO heuristic can yield
superior efficient frontiers compared to PSO, GAs and two classical