SlideShare une entreprise Scribd logo
1  sur  118
Télécharger pour lire hors ligne
ISSN: 1694-2507 (Print)
ISSN: 1694-2108 (Online)
International Journal of Computer Science
and Business Informatics
(IJCSBI.ORG)
VOL 9, NO 1
JANUARY 2014
Table of Contents VOL 9, NO 1 JANUARY 2014
A Predictive Stock Data Analysis with SVM-PCA Model .......................................................................1
Divya Joseph and Vinai George Biju
HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space.......................................... 12
Mohammad Reza Abbasifard, Hassan Naderi and Mohadese Mirjalili
A Survey on Mobile Malware: A War without End................................................................................... 23
Sonal Mohite and Prof. R. S. Sonar
An Efficient Design Tool to Detect Inconsistencies in UML Design Models............................................. 36
Mythili Thirugnanam and Sumathy Subramaniam
An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment
Analysis, Ant Colony Optimization and Gene Expression Programming ................................................. 45
Chih-Ming Hsu
Emerging Technologies: LTE vs. WiMAX ................................................................................................... 66
Mohammad Arifin Rahman Khan and Md. Sadiq Iqbal
Introducing E-Maintenance 2.0 ................................................................................................................. 80
Abdessamad Mouzoune and Saoudi Taibi
Detection of Clones in Digital Images........................................................................................................ 91
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary
The Significance of Genetic Algorithms in Search, Evolution, Optimization and Hybridization: A Short
Review ...................................................................................................................................................... 103
IJCSBI.ORG
Kunjal Bharatkumar Mankad
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 1
A Predictive Stock Data Analysis
with SVM-PCA Model
Divya Joseph
PG Scholar, Department of Computer Science and Engineering
Christ University Faculty of Engineering
Christ University, Kanmanike, Mysore Road, Bangalore - 560060
Vinai George Biju
Asst. Professor, Department of Computer Science and Engineering
Christ University Faculty of Engineering
Christ University, Kanmanike, Mysore Road, Bangalore – 560060
ABSTRACT
In this paper the properties of Support Vector Machines (SVM) on the financial time series
data has been analyzed. The high dimensional stock data consists of many features or
attributes. Most of the attributes of features are uninformative for classification. Detecting
trends of stock market data is a difficult task as they have complex, nonlinear, dynamic and
chaotic behaviour. To improve the forecasting of stock data performance different models
can be combined to increase the capture of different data patterns. The performance of the
model can be improved by using only the informative attributes for prediction. The
uninformative attributes are removed to increase the efficiency of the model. The
uninformative attributes from the stock data are eliminated using the dimensionality
reduction technique: Principal Component Analysis (PCA). The classification accuracy of
the stock data is compared when all the attributes of stock data are being considered that is,
SVM without PCA and the SVM-PCA model which consists of informative attributes.
Keywords
Machine Learning, stock analysis, prediction, support vector machines, principal
component analysis.
1. INTRODUCTION
Time series analysis and prediction is an important task in all fields of
science for applications like forecasting the weather, forecasting the
electricity demand, research in medical sciences, financial forecasting,
process monitoring and process control, etc [1][2][3]. Machine learning
techniques are widely used for solving pattern prediction problems. The
financial time series stock prediction is considered to be a very challenging
task for analysts, investigator and economists [4]. A vast number of studies
in the past have used artificial neural networks (ANN) and genetic
algorithms for the time series data [5]. Many real time applications are using
the ANN tool for time-series modelling and forecasting [6]. Furthermore the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 2
researchers hybridized the artificial intelligence techniques. Kohara et al. [7]
incorporated prior knowledge to improve the performance of stock market
prediction. Tsaih et al. [8] integrated the rule-based technique and ANN to
predict the direction of the S& P 500 stock index futures on a daily basis.
Some of these studies, however, showed that ANN had some limitations in
learning the patterns because stock market data has tremendous noise and
complex dimensionality [9]. ANN often exhibits inconsistent and
unpredictable performance on noisy data [10]. However, back-propagation
(BP) neural network, the most popular neural network model, suffers from
difficulty in selecting a large number of controlling parameters which
include relevant input variables, hidden layer size, learning rate, and
momentum term [11].
This paper proceeds as follows. In the next section, the concepts of support
vector machines. Section 3 describes the principal component analysis.
Section 4 describes the implementation and model used for the prediction of
stock price index. Section 5 provides the results of the models. Section 6
presents the conclusion.
2. SUPPORT VECTOR MACHINES
Support vector machines (SVMs) are very popular linear discrimination
methods that build on a simple yet powerful idea [12]. Samples are mapped
from the original input space into a high-dimensional feature space, in
which a „best‟ separating hyperplane can be found. A separating hyperplane
H is best if its margin is largest [13].
The margin is defined as the largest distance between two hyperplanes
parallel to H on both sides that do not contain sample points between them
(we will see later a refinement to this definition) [12]. It follows from the
risk minimization principle (an assessment of the expected loss or error, i.e.,
the misclassification of samples) that the generalization error of the
classifier is better if the margin is larger.
The separating hyperplane that are the closest points for different classes at
maximum distance from it is preferred, as the two groups of samples are
separated from each other by a largest margin, and thus least sensitive to
minor errors in the hyperplane‟s direction [14].
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 3
2.1 Linearly Separable Data
Consider that there exist two classes and uses two labels -1 and +1 for two
classes. The sample is { , }t t
x r  where rt
= +1 if xt
ϵ C1 and rt
= -1 if xt
ϵ C2.
To find w and w0 such that
where,  represents set of n points
xt
represents p dimensional real vector
rt
represents the class (i.e. +1 or -1)
0 1 for r 1T t t
w x w    
0 1 for r 1T t t
w x w    
Which can be rewritten as:
0( ) 1t T t
r w x w   (1)
Here the instances are required to be on the right of the hyperplane and what
them to be a distance away for better generalization. The distance from the
hyperplane to the instances closest to it on either side is called the margin,
which we want to maximize for best generalization.
The optimal separating hyperplane is the one that maximizes the margin.
The following equation represents the offset of hyperplane from the origin
along the normal w.
0| |
|| ||
T t
w x w
w

which, when rt
ϵ {+1,-1}, can be written as
0( )
|| ||
t T t
r w x w
w

Consider this to be some value ρ:
0( )
, t
|| ||
t T t
r w x w
w


  (2)
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 4
In order to maximize ρ but there are an infinite number of solutions that are
obtained by scaling w, therefore consider ρ ||w|| = 1. Thus to maximize the
margin ||w|| is minimized.
2
0
1
min || || subject to r ( ) 1,
2
t T t
w w x w t    (3)
Figure 1 The geometry of the margin consists of the canonical hyperplanes H1 and H2.
The margin is the distance between the separating (g(x) =0) and a
hyperplane through the closest points (marked by a ring around the data
points). The round rings are termed as support vectors.
This is a standard optimization problem, whose complexity depends on d,
and it can be solved directly to find w and w0. Then, on both sides of the
hyperplane, there will be instances that are 1
|| ||w
. As there will be two
margins along the sides of the hyperplane we sum it up to
2
|| ||w
.
If the problem is not linearly separable instead of fitting a nonlinear
function, one trick is to map the problem to a new space by using nonlinear
basis function. Generally the new spaces has many more dimensions than
the original space, and in such a case, the most interesting part is the method
whose complexity does not depend on the input dimensionality. To obtain a
new formulation, the Eq. (3) is written as an unconstrained problem using
Lagrange multipliers αt
:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 5
2
0
1
2
0
1 1
1
|| || [ ( ) 1]
2
1
= || || ( ) +
2
N
t t T t
p
t
t t T t t
t t
L w r w x w
w r w x w

 

 
   
 

 
This can be minimized with respect to w, w0 and maximized with respect to
αt
≥ 0. The saddle point gives the solution.
This is a convex quadratic optimization problem because the main term is
convex and the linear constraints are also convex. Therefore, the dual
problem is solved equivalently by making use of the Karush-Kuhn-Tucker
conditions. The dual is to maximize Lp with respect to w and w0 are 0 and
also that αt
≥ 0.
1
0 w =
n
p t t t
i
L
r x
w



 

 (5)
10
0 w = = 0
n
p t t
i
L
r
w



 

 (6)
Substituting Eq. (5) and Eq. (6) in Eq. (4), the following is obtained:
0
1
( )
2
T T t t t t t t
d
t t t
L w w w r x w r       
1
= - ( )
2
t s t s t T s t
t s t
r x x x    (7)
which can be minimized with respect to αt
only, subject to the constraints
0, and 0, tt t t
t
r   
This can be solved using the quadratic optimization methods. The size of the
dual depends on N, sample size, and not on d, the input dimensionality.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 6
Once αt
is solved only a small percentage have αt
> 0 as most of them vanish
with αt
= 0.
The set of xt
whose xt
> 0 are the support vectors, then w is written as
weighted sum of these training instances that are selected as support vectors.
These are the xt
that satisfy and lie on the margin. This can be used to
calculate w0 from any support vector as
0
t T t
w r w x  (8)
For numerical stability it is advised that this be done for all support vectors
and average be taken. The discriminant thus found is called support vector
machine (SVM) [1].
3. PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a powerful tool for dimensionality
reduction. The advantage of PCA is that if the data patterns are understood
then the data is compressed by reducing the number of dimensions. The
information loss is considerably less.
Figure 2 Diagrammatic Representation of Principal Component Analysis (PCA)
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 7
4. CASE STUDY
An investor in stocks ideally should get maximum returns on the investment
made and for that should know which stocks will do well in future. So this
is the basic incentive for forecasting stock prices. For this, he has to study
about different stocks, their price history, performance and reputation of the
stock company, etc. So this is a broad area of study. There exists
considerable evidence showing that stock returns are to some extent
predictable. Most of the research is conducted using data from well
established stock markets such as the US, Western Europe, and Japan. It is,
thus, of interest to study the extent of stock market predictability using data
from less well established stock markets such as that of India.
Analysts monitor changes of these numbers to decide their trading. As long
as past stock prices and trading volumes are not fully discounted by the
market, technical analysis has its value on forecasting. To maximize profits
from the stock market, more and more “best” forecasting techniques are
used by different traders. The research data set that has been used in this
study is from State Bank of India. The series spans from 10th January 2012
to 18th September 2013. The first training and testing dataset consists of 30
attributes. The second training and testing dataset consists of 5 attributes
selected from the dimensionality reduction technique using Weka tool:
PCA.
Table 1 Number of instances in the case study
State Bank of India Stock Index
Total Number of Instances 400
Training Instances 300
Testing Instances 100
The purpose of this study is to predict the directions of daily change of the
SBI Index. Direction is a categorical variable to indicate the movement
direction of SBI Index at any time t. They are categorized as “0” or “1” in
the research data. “0” means that the next day‟s index is lower than today‟s
index, and “1” means that the next day‟s index is higher than today‟s index.
The stock data classification is implementation with Weka 3.7.9. The k-fold
cross validation is considered for the classification. In the k-fold cross-
validation, the original sample is randomly partitioned into k subsamples.
Of the k subsamples, a single subsample is retained as the validation data
for testing the model, and the remaining k – 1 subsamples are used as
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 8
training data [15]. The cross validation variable k is set to 10 for the stock
dataset [16].The cross-validation process is then repeated k times (the folds),
with each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined) to
produce a single estimation.
Figure 3 Weka Screenshot of PCA
At first the model is trained with SVM and the results with the test data is
saved. Second, the dimensionality reduction technique such as PCA is
applied to the training dataset. The PCA selects the attributes which give
more information for the stock index classification. The number of attributes
for classification is now reduced from 30 attributes to 5 attributes.
The most informative attributes are only being considered for classification.
A new model is trained on SVM with the reduced attributes. The test data
with reduces attributes is provided to the model and the result is saved. The
results of both the models are compared and analysed.
5. EXPERIMENTAL RESULTS
5.1 Classification without using PCA
From the tables displayed below 300 stock index instances were considered
as training data and 100 stock index instances were considered as test data.
With respect to the test data 43% instances were correctly classified and
57% instances were incorrectly classified.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 9
Table 2 Number of instances for classification without using PCA
Number of Instances and Attributes
Number of Train Instances Number of Test Instances Number of
Attributes
300 100 30
Table 3 Classification accuracy without using PCA
Classification Accuracy
Correctly Classified Instances 43%
Incorrectly Classified Instances 57%
5.2 Classification with PCA
From the tables displayed below 300 stock index instances were considered
as training data and 100 stock index instances were considered as test data.
With respect to the test data 59% instances were correctly classified and
41% instances were incorrectly classified.
Table 4 Number of instances for classification without using PCA
Number of Instances and Attributes
Number of Train Instances Number of Test Instances Number of
Attributes
300 100 5
Table 5 Classification accuracy without using PCA
Classification Accuracy
Correctly Classified Instances 59%
Incorrectly Classified Instances 41%
6. CONCLUSION
The Support Vector Machines can produce accurate and robust
classification results on a sound theoretical basis, even when input stock
data are non-monotone and non-linearly separable. The Support Vector
Machines evaluates more relevant information in a convenient way. The
principal component analysis is an efficient dimensionality reduction
method which gives a better SVM classification on the stock data. The
SVM-PCA model analyzes the stock data with fewer and most relevant
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 10
features. In this way a better idea about the stock data is obtained and in turn
gives an efficient knowledge extraction on the stock indices. The stock data
classified better with SVM-PCA model when compared to the classification
with SVM alone. The SVM-PCA model also reduces the computational cost
drastically. The instances are labelled with nominal values for the current
case study. The future enhancement to this paper would be to use numerical
values for labelling instead of nominal values.
7. ACKNOWLEDGMENTS
We express our sincere gratitude to the Computer Science and Engineering
Department of Christ University Faculty of Engineering especially
Prof. K Balachandran for his constant motivation and support.
REFERENCES
[1] Divya Joseph, Vinai George Biju, “A Review of Classifying High Dimensional Data to
Small Subspaces”, Proceedings of International Conference on Business Intelligence at
IIM Bangalore, 2013.
[2] Claudio V. Ribeiro, Ronaldo R. Goldschmidt, Ricardo Choren, A Reuse-based
Environment to Build Ensembles for Time Series Forecasting, Journal of Software,
Vol. 7, No. 11, Pages 2450-2459, 2012.
[3] Dr. A. Chitra, S. Uma, "An Ensemble Model of Multiple Classifiers for Time Series
Prediction", International Journal of Computer Theory and Engineering, Vol. 2, No. 3,
pages 454-458, 2010.
[4] Sundaresh Ramnath, Steve Rock, Philip Shane, "The financial analyst forecasting
literature: A taxonomy with suggestions for further research", International Journal of
Forecasting 24 (2008) 34–75.
[5] Konstantinos Theofilatos, Spiros Likothanassis, Andreas Karathanasopoulos, Modeling
and Trading the EUR/USD Exchange Rate Using Machine Learning Techniques,
ETASR - Engineering, Technology & Applied Science Research Vol. 2, No. 5, pages
269-272, 2012.
[6] A simulation study of artificial neural networks for nonlinear time-series forecasting.
G. Peter Zhang, B. Eddy Patuwo, and Michael Y. Hu. Computers & OR 28(4):381-
396 (2001)
[7] K. Kohara, T. Ishikawa, Y. Fukuhara, Y. Nakamura, Stock price prediction using prior
knowledge and neural networks, Int. J. Intell. Syst. Accounting Finance Manage. 6 (1)
(1997) 11–22.
[8] R. Tsaih, Y. Hsu, C.C. Lai, Forecasting S& P 500 stock index futures with a hybrid AI
system, Decision Support Syst. 23 (2) (1998) 161–174.
[9] Mahesh Khadka, K. M. George, Nohpill Park, "Performance Analysis of Hybrid
Forecasting Model In Stock Market Forecasting", International Journal of Managing
Information Technology (IJMIT), Vol. 4, No. 3, August 2012.
[10]Kyoung-jae Kim, “Artificial neural networks with evolutionary instance selection for
financial forecasting. Expert System. Application 30, 3 (April 2006), 519-526.
[11]Guoqiang Zhang, B. Eddy Patuwo, Michael Y. Hu, “Forecasting with artificial neural
networks: The state of the art”, International Journal of Forecasting 14 (1998) 35–62.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 11
[12]K. Kim, I. Han, Genetic algorithms approach to feature discretization in artificial
neural networks for the prediction of stock price index, Expert Syst. Appl. 19 (2)
(2000) 125–132.
[13]F. Cai and V. Cherkassky “Generalized SMO algorithm for SVM-based multitask
learning", IEEE Trans. Neural Netw. Learn. Syst., Vol. 23, No. 6, pp.997 -1003, 2012.
[14]Corinna Cortes and Vladimir Vapnik, Support-Vector Networks. Mach. Learn. 20,
Volume 3, 273-297, 1995.
[15]Shivanee Pandey, Rohit Miri, S. R. Tandan, "Diagnosis And Classification Of
Hypothyroid Disease Using Data Mining Techniques", International Journal of
Engineering Research & Technology, Volume 2 - Issue 6, June 2013.
[16]Hui Shen, William J. Welch and Jacqueline M. Hughes-Oliver, "Efficient, Adaptive
Cross-Validation for Tuning and Comparing Models, with Application to Drug
Discovery", The Annals of Applied Statistics 2011, Vol. 5, No. 4, 2668–2687,
February 2012, Institute of Mathematical Statistics.
This paper may be cited as:
Joseph, D. and Biju, V. G., 2014. A Predictive Stock Data Analysis with
SVM-PCA Model. International Journal of Computer Science and Business
Informatics, Vol. 9, No. 1, pp. 1-11.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 12
HOV-kNN: A New Algorithm to
Nearest Neighbor Search in
Dynamic Space
Mohammad Reza Abbasifard
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
Hassan Naderi
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
Mohadese Mirjalili
Department of Computer Engineering,
Iran University of Science and Technology,
Tehran, Iran
ABSTRACT
Nearest neighbor search is one of the most important problem in computer science due to
its numerous applications. Recently, researchers have difficulty to find nearest neighbors in
a dynamic space. Unfortunately, in contrast to static space, there are not many works in this
new area. In this paper we introduce a new nearest neighbor search algorithm (called
HOV-kNN) suitable for dynamic space due to eliminating widespread preprocessing step in
static approaches. The basic idea of our algorithm is eliminating unnecessary computations
in Higher Order Voronoi Diagram (HOVD) to efficiently find nearest neighbors. The
proposed algorithm can report k-nearest neighbor with time complexity O(knlogn) in
contrast to previous work which wasO(k2
nlogn). In order to show its accuracy, we have
implemented this algorithm and evaluated is using an automatic and randomly generated
data point set.
Keywords
Nearest Neighbor search, Dynamic Space, Higher Order Voronoi Diagram.
1. INTRODUCTION
The Nearest Neighbor search (NNS) is one of the main problems in
computer science with numerous applications such as: pattern recognition,
machine learning, information retrieval and spatio-temporal databases [1-6].
Different approaches and algorithms have been proposed to these diverse
applications. In a well-known categorization, these approaches and
algorithms could be divided into static and dynamic (moving points). The
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 13
existing algorithms and approaches can be divided into three categories,
based on the fact that whether the query points and/or data objects are
moving. They are (i) static kNN query for static objects, (ii) moving
kNNquery for static objects, and (iii) moving kNN query for moving objects
[15].
In the first category data points as well as query point(s) have stationary
positions [4, 5]. Most of these approaches, first index data points by
performing a pre-processing operation in order to constructing a specific
data structure. It’s usually possible to carry out different search algorithms
on a given data structure to find nearest neighbors. Unfortunately, the pre-
processing step, index construction, has a high complexity and takes more
time in comparison to search step. This time could be reasonable when the
space is static, because by just constructing the data structure multiple
queries can be accomplished. In other words, taken time to pre-processing
step will be amortized over query execution time. In this case, searching
algorithm has a logarithmic time complexity. Therefore, these approaches
are useful, when it’s necessary to have a high velocity query execution on
large stationary data volume.
Some applications need to have the answer to a query as soon as the data is
accessible, and they cannot tolerate the pre-processing execution time. For
example, in a dynamic space when data points are moving, spending such
time to construct a temporary index is illogical. As a result approaches that
act very well in static space may be useless in dynamic one.
In this paper a new method, so called HOV-kNN, suitable for finding k
nearest neighbor in a dynamic environment, will be presented. In k-nearest
neighbor search problem, given a set P of points in a d-dimensional
Euclidian space𝑅 𝑑
(𝑃 ⊂ 𝑅 𝑑
) and a query point q (𝑞 ∈ 𝑅 𝑑
), the problem is
to find k nearest points to the given query point q [2, 7]. Proposed algorithm
has a good query execution complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) without enduring from
time-consuming pre-processing process. This approach is based on the well-
known Voronoi diagrams (VD) [11]. As an innovation, we have changed the
Fortune algorithm [13] in order to created order k Voronoi diagrams that
will be used for finding kNN.
The organization of this paper is as follow. Next section gives an overview
on related works. In section 3 basic concepts and definitions have been
presented. Section 4 our new approach HOV-kNN is explained. Our
experimental results are discussed in section 5. We have finished our paper
with a conclusion and future woks in section 6.
2. RELATED WORKS
Recently, many methods have been proposed for k-nearest neighbor search
problem. A naive solution for the NNS problem is using linear search
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 14
method that computes distance from the query to every single point in the
dataset and returns the k closest points. This approach is guaranteed to find
the exact nearest neighbors [6]. However, this solution can be expensive for
massive datasets. So approximate nearest neighbor search algorithms are
presented even for static spaces [2].
One of the main parts in NNS problem is data structure that is roughly used
in every approach. Among different data structures, various tree search most
used structures which can be applied in both static and dynamic spaces.
Listing proposed solutions to kNN for static space is out of scope of this
paper. The interested reader can refer to more comprehensive and detailed
discussions of this subject by [4, 5]. Just to name some more important
structures, we can point to kd-tree, ball-tree, R-tree, R*-tree, B-tree and X-
tree [2-5, 8, 9].In contrast, there are a number of papers that use graph data
structure for nearest neighbor search. For example, Hajebi et al have
performed Hill-climbing in kNN graph. They built a nearest neighbor graph
in an offline phase, and performed a greedy search on it to find the closest
node to the query [6].
However, the focus of this paper is on dynamic space. In contrast to static
space, finding nearest neighbors in a dynamic environment is a new topic of
research with relatively limited number of publications. Song and
Roussopoulos have proposed Fixed Upper Bound Algorithm, Lazy Search
Algorithm, Pre-fetching Search Algorithm and Dual Buffer Search to find k-
nearest neighbors for a moving query point in a static space with stationary
data points [8]. Güting et al have presented a filter-and-refine approach to
kNN search problem in a space that both data points and query points are
moving. The filter step traverses the index and creates a stream of so-called
units (linear pieces of a trajectory) as a superset of the units required to build
query’s results. The refinement step processes an ordered stream of units
and determines the pieces of units forming the final precise result
[9].Frentzos et al showed mechanisms to perform NN search on structures
such as R-tree, TB-Tree, 3D-R-Tree for moving objects trajectories. They
used depth-first and best-first algorithms in their method [10].
As mentioned, we use Voronoi diagram [11] to find kNN in a dynamic
space. D.T. Lee used Voronoi diagram to find k nearest neighbor. He
described an algorithm for computing order-k Voronoi diagram in
𝑂(𝑘2
𝑛𝑙𝑜𝑔𝑛) time and 𝑂(𝑘2
(𝑁 − 𝑘)) space [12] which is a sequential
algorithm. Henning Meyerhenke presented and analyzed a parallel
algorithm for constructing HOVD for two parallel models: PRAM and CGM
[14]. In these models he used Lee’s iterative approach but his model stake
𝑂
𝑘2(𝑛−𝑘)𝑙𝑜𝑔𝑛
𝑝
running time and 𝑂(𝑘) communication rounds on a CGM
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 15
with 𝑂(
𝑘2(𝑁−𝑘)
𝑝
) local memory per processor [14]. p is the number of
participant machines.
3. BASIC CONCEPTS AND DEFINITIONS
Let P be a set of n sites (points) in the Euclidean plane. The Voronoi
diagram informally is a subdivision of the plane into cells (Figure 1)which
each point of that has the same closest site [11].
Figure 1.Voronoi Diagram
Euclidean distance between two points p and q is denoted by 𝑑𝑖𝑠𝑡 𝑝, 𝑞 :
𝑑𝑖𝑠𝑡 𝑝, 𝑞 : = (𝑝𝑥 − 𝑞𝑥)2 + (𝑝𝑦 − 𝑞𝑦)2 (1)
Definition (Voronoi diagram):Let 𝑃 = {𝑝1, 𝑝2, … , 𝑝 𝑛 } be a set of n distinct
points (so called sites) in the plane. Voronoi diagram of P is defined as the
subdivision of the plane into n cells, one for each site in P, with the
characteristic that q in the cell corresponding to site 𝑝𝑖 if𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑖 <
𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑗 for each 𝑝𝑗 ∈ 𝑃 𝑤𝑖𝑡ℎ 𝑗 ≠ 𝑖 [11].
Historically, 𝑂(𝑛2
)incremental algorithms for computing VD were known
for many years. Then 𝑂 𝑛𝑙𝑜𝑔𝑛 algorithm was introduced that this
algorithm was based on divide and conquer, which was complex and
difficult to understand. Then Steven Fortune [13] proposed a plane sweep
algorithm, which provided a simpler 𝑂 𝑛𝑙𝑜𝑔𝑛 solution to the problem.
Instead of partitioning the space into regions according to the closest sites,
one can also partition it according to the k closest sites, for some 1 ≤ 𝑘 ≤
𝑛 − 1. The diagrams obtained in this way are called higher-order Voronoi
diagrams or HOVD, and for given k, the diagram is called the order-k
Voronoi diagram [11]. Note that the order-1 Voronoi diagram is nothing
more than the standard VD. The order-(n−1) Voronoi diagram is the
farthest-point Voronoi diagram (Given a set P of points in the plane, a point
of P has a cell in the farthest-point VD if it is a vertex of the convex hull),
because the Voronoi cell of a point 𝑝𝑖 is now the region of points for which
𝑝𝑖 is the farthest site. Currently the best known algorithms for computing the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 16
order-k Voronoi diagram run in 𝑂(𝑛𝑙𝑜𝑔3
𝑛 + 𝑛𝑘) time and in 𝑂(𝑛𝑙𝑜𝑔𝑛 +
𝑛𝑘2 𝑐𝑙𝑜𝑔 ∗ 𝑘
) time, where c is a constant [11].
Figure 2. Farthest-Point Voronoi diagram [11]
Consider x and y as two distinct elements of P. A set of points construct a
cell in the second order Voronoi diagram for which the nearest and the
second nearest neighbors are x and y. Second order Voronoi diagram can be
used when we are interested in the two closest points, and we want a
diagram to captures that.
Figure 3.An instant of HOVD [11]
4. SUGGESTED ALGORITHM
As mentioned before, one of the best algorithms to construct Voronoi
diagram is Fortune algorithm. Furthermore HOVD can be used to find k-
nearest neighbors [12]. D.T. Lee used an 𝑂 𝑘2
𝑛𝑙𝑜𝑔𝑛 algorithm to
construct a complete HOVD to obtain nearest neighbors. In D.T. Lee's
algorithm, at first the first order Voronoi diagram is obtained, and then finds
the region of diagram that contains query point. The point that is in this
region is defined as a first neighbor of query point. In the next step of Lee’s
algorithm, this nearest point to the query will be omitted from dataset, and
this process will be repeated. In other words, the Voronoi diagram is built
on the rest of points. In the second repetition of this process, the second
neighbor is found and so on. So the nearer neighbors to a given query point
are found sequentially.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 17
However we think that nearest neighbors can be finding without completing
the process of HOVD construction. More precisely, in Lee’s algorithm each
time after omitting each nearest neighbor, next order of Voronoi diagram is
made completely (edges and vertices) and then for computing a neighbor
performs the search algorithm. In contrast, in our algorithm, the vertices of
Voronoi diagram are only computed and the neighbors of the query are
found during process of vertices computing. So in our algorithm, the
overhead of edge computing to find neighbors is effectively omitted. As we
will show later in this paper, by eliminating this superfluous computation a
more efficiently algorithm in term of time complexity will be obtained.
We use Fortune algorithm to create Voronoi diagram. Because of space
limitation in this paper we don’t describe this algorithm and the respectable
readers can refer to [11, 13]. By moving sweep line in Fortune algorithm,
two set of events are emerged; site event and circle event [11]. To find k
nearest neighbors in our algorithm, the developed circle events are
employed. There are specific circle events in the algorithm that are not
actual circle events named false alarm circle events. Our algorithm (see the
next section) deals efficiently with real circle events and in contrast doesn't
superfluously consider the false alarm circle event. A point on the plane is
inside a circle when its distance from the center of the circle is less than
radius of the circle. The vertices of a Voronoi diagram are the center of
encompassing triangles where each 3 points (sites) constitute the triangles.
The main purpose of our algorithm is to find out a circle in which the
desired query is located.
As the proposed algorithm does not need pre-processing, it’s completely
appropriate for dynamic environment where we can't endure very time
consuming pre-processing overheads. Because, as the readers may know, in
k-NN search methods a larger percent of time is dedicated to constructing a
data structure (usually in the form of a tree). This algorithm can be efficient,
especially when there are a large number of points while their motion is
considerable.
4.1 HOV-kNN algorithm
After describing our algorithm in the previous paragraph briefly, we will
elaborate it formally in this section. When the first order Voronoi diagram is
constructed, some of the query neighbors can be obtained in complexity of
the Fortune algorithm (i.e.𝑂(𝑛𝑙𝑜𝑔𝑛)). This fact forms the first step of our
algorithm. When the discovered circle event in HandleCircleEvent of the
Fortune algorithm is real (initialized by the variable “check” in line 6 of the
algorithm, and by default function HandleCircleEvent returns “true” when
circle even is real) the query distance is measured from center of the circle.
Moreover, when the condition in line 7.i of the algorithm is true, the three
points that constitute the circle are added to NEARS list if not been added
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 18
before (function PUSH-TAG (p) shows whether it is added to NEAR list or
not).
1) Input : q , a query
2) Output: list NEARS, k nearest neighbors.
3) Procedure :
4) Initialization :
5) NEARS ={}, K nearest neighbors
, Check = false, MOD = 0, V = {} (hold Voronoipoints( ;
6) Check = HandleCircleEvent()
7) If check= true, then -- detect a true circle event.
i) If distance(q , o) < r Then
(1) If PUSH-TAG(p1) = false , Then
(a) add p1 to NEARS
(2) If PUSH-TAG (p2) = false , Then
(a) add p2 to NEARS
ii) If PUSH-TAG(p3) = false, Then
(a) add p3 to NEARS
Real circle events are discovered up to this point and the points that
constitute the events are added to neighbor list of the query. As pointed out
earlier, the preferred result is obtained, if “k” inputs are equal or lesser than
number of the obtained neighbors a𝑂(𝑛𝑙𝑜𝑔𝑛)complexity.
8) if SIZE (NEARS) >= k , then
a. sort (NERAS ) - - sort NEARS by distance
b. for i = 1 to k
i. print (NEARS);
9) else if SIZE (NEARS) = k
ii. print(NEARS);
The algorithm enters the second step if the conditions of line 8 and 9 in the
first part are not met. The second part compute vertices of Voronoi
sequentially, so that the obtained vertices are HOV vertex. Under sequential
method for developing HOV [12], the vertices of the HOV are obtained by
omitting the closer neighbors. Here, however, to find more neighbors
through sequential method, loop one of the closest neighbor and loop one of
the farthest neighbor are deleted alternatively from the set of the point. This
leads to new circles that encompass the query. Afterward, the same
calculations described in section one are carried out for the remaining points
(the removed neighbors are recorded a list named REMOVED_POINTS).
The calculations are carried out until the loop condition in line 5 is met.
10) Else if (SIZE(NEARS) < k )
c. if mod MOD 2 = 0 , then
i. add nearest_Point to REMOVED_POINT ;
ii. Remove(P,nearest_Point);
d. if mod MOD 2 = 1 , then
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 19
i. add farthest_Point to REMOVED_POINT ;
ii. Remove(P,nearest_Point);
11) Increment MOD ;
12) produce line 6 to 9 from part1 for remind points P ;
13) Repeat until k >= SIZE _ LIST (NEARS) + SIZE _ LIST (REMOVED_POINT) ;
14) PRINT (NEARS) ;
Should the number of neighbors be less than required number of neighbors,
the algorithm starts the third part. At this part, Voronoi vertices and their
distance from query are recorded in a list. As explained for the first part of
the algorithm, the Voronoi vertices in the Fortune algorithm and their
distance to the query are enough to check realization of the condition of line
8. The vertices and their distance to the query are recorded. Following line
will be added after line 7 in the first part:
add pair(Voronoi_Vertex ,distance_To_Query) to List V
Moreover, along with adding input point to the list of the neighbors, their
distance to the query must be added to the list.
Using these two lists (after being filled, the lists can be ranked based on
their distance to query) the nearest point or Voronoi vertices is obtainable.
The nearest point can be considered as the input query and the whole
process of 1st
and 2nd
parts of the algorithm is repeated until required
number of neighbors is achieved. Finally, to have more number of
neighbors, the method can be repeated sequentially over the closer points to
the query. This part of the algorithm has the same complexities of the two
other sections as the whole process to find the preliminary query is repeated
for the representatives of the query.
Figure 4.implementation of HOVD
In Figure 4 "o" is a vertex of Voronoi and a center point of circle event that
is created by 𝑝1, 𝑝2 and 𝑝3. Based on algorithm the circle that encompasses
the query, add 𝑝1, 𝑝2 and 𝑝3 points as neighbors of query to the neighbors'
list. Here k is near to n, so by computing higher order of Voronoi, the circle
will be bigger and bigger. Thus farther neighbors are added to query
neighbors' list.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 20
4.2 The complexity of HOV-kNN
As mentioned before, HOV-kNN algorithm has a time complexity lesser
than the time complexity of D.T. Lee’s algorithm. To show this fact,
consider the presented algorithm in the previous section. Line 13 explains
that the main body of algorithm must be repeated k times in which "k" are
the number of neighbors that should be found. In each repetition one of the
query’s neighbors are detected by algorithm and subsequently eliminated
from dataset. The principle part of our algorithm that is the most time
consuming part too is between lines 6 and 9. This line recalls modified
Fortune algorithm which has a time complexity𝑂(𝑛𝑙𝑜𝑔𝑛). Therefore the
overall complexity of our algorithm will be:
𝑂 𝑛𝑙𝑜𝑔𝑛
𝑘
𝑖=1
= 𝑂 𝑛𝑙𝑜𝑔𝑛 1
𝑘
𝑖=0
= 𝑘𝑂 𝑛𝑙𝑜𝑔𝑛 = 𝑂 𝑘𝑛𝑙𝑜𝑔𝑛 (2)
In comparison to the algorithm introduced in [12] (which has a time
complexity𝑂(𝑘2
𝑛𝑙𝑜𝑔𝑛)) our algorithm is faster k times. The main reason of
this difference is that Lee’s algorithm completely computes the HOVD,
while ours exploits a fraction of HOVD construction process. In term of
space complexity, the space complexity of our algorithm is the same as the
space complexity of Fortune algorithm: 𝑂(𝑛).
5. IMPLEMENTATION AND EVALUATION
This section introduces the results of the HOV-kNN algorithm and
compares the results with other algorithms. We use Voronoi diagram which
is used to find k nearest neighbor points that is less complicated. The
proposed algorithm was implemented using C++. For maintaining data
points vector data structure, which is one of the C++ standard libraries, was
used. The input data points used in the program test were adopted randomly.
To reach preferred data distribution, not too close/far points, they were
generated under specific conditions. For instance, for 100 input points, the
point generation range is 0-100 and for 500 input points the range is 0-500.
To ensure accuracy and validity of the output, a simple kNN algorithm was
implemented and the outputs of the two algorithms were compared (equal
input, equal query). Outputs evaluation was also carried out sequentially and
the outputs were stored in two separate files. Afterward, to compare
similarity rate, the two files were used as input to another program.
The evaluation was also conducted in two steps. First the parameter “k” was
taken as a constant and the evaluation was performed using different points
of data as input. As pictured in Figure 5, accuracy of the algorithm is more
than 90%. In this diagram, the number of inputs in dataset varies between 10
and 100000. At the second step, the evaluation was conducted with different
values of k, while the number of input data was stationary. Accuracy of the
algorithm was obtained 74% while “k” was between 10 and 500 (Figure 6).
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 21
Figure 5. The accuracy of the algorithm for constant k and different points of data as input
Figure 6. The accuracy of the algorithm for variable k and constant data as input
6. CONCLUSION AND FUTURE WORK
We have introduced a new algorithm (named HOV-kNN) with time
complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) and computing order k Voronoi diagram to find k
nearest neighbor in a set of N points in Euclidean space. The new proposed
algorithm finds k nearest neighbors in two stages: 1) during constructing the
first order Voronoi diagram, some of the query neighbors can be obtained in
complexity of the Fortune algorithm; 2) computing vertices of Voronoi
sequentially. Because of eliminating pre-processing steps, this algorithm is
significantly suitable for dynamic space in which data points are moving.
The experiments are done in twofold: 1) constant number of data points
while k is variable, and 2) variable number of data points while k is
constant. The obtained results show that this algorithm has sufficient
accuracy to be applied in real situation. In our future work we will try to
give a parallel version of our algorithm in order to efficiently
implementation a parallel machine to obtain more speed implementation.
Such an algorithm will be appropriate when the numbers of input points are
massive and probably distributed on a network of computers.
0%
20%
40%
60%
80%
100%
50
200
350
500
2000
5000
8000
20000
50000
80000
percent
input data
Accuracy
0%
20%
40%
60%
80%
100%
0 100 200 300 400 500
percent
k
Accuracy
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 22
REFERENCES
[1] Lifshits, Y.Nearest neighbor search: algorithmic perspective, SIGSPATIAL Special.
Vol. 2, No 2, 2010, 12-15.
[2] Shakhnarovich, G., Darrell, T., and Indyk, P.Nearest Neighbor Methods in Learning
and Vision: Theory and Practice, The MIT Press, United States, 2005.
[3] Andoni, A.Nearest Neighbor Search - the Old, the New, and the Impossible, Doctor of
Philosophy, Electrical Engineering and Computer Science, Massachusetts Institute of
Technology,2009.
[4] Bhatia, N., and Ashev, V. Survey of Nearest Neighbor Techniques, International
Journal of Computer Science and Information Security, Vol. 8, No 2, 2010, 1- 4.
[5] Dhanabal, S., and Chandramathi, S. A Review of various k-Nearest Neighbor Query
Processing Techniques, Computer Applications, Vol. 31, No 7, 2011, 14-22.
[6] Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., and Zhang, H.Fast approximate nearest-
neighbor search with k-nearest neighbor graph, In Proceedings of 22 international joint
conference on Artificial Intelligence, Vol. 2 (IJCAI'11), Toby Walsh (Ed.), 2011, 1312-
1317.
[7] Fukunaga, K. Narendra, P. M. A Branch and Bound Algorithm for Computing k-
Nearest Neighbors, IEEE Transactions on Computer,Vol. 24, No 7, 1975, 750-753.
[8] Song, Z., Roussopoulos, N. K-Nearest Neighbor Search for Moving Query Point, In
Proceedings of the 7th International Symposium on Advances in Spatial and Temporal
Databases (Redondo Beach, California, USA), Springer-Verlag, 2001, 79-96.
[9] Güting, R., Behr, T., and Xu, J. Efficient k-Nearest Neighbor Search on moving object
trajectories, The VLDB Journal 19, 5, 2010, 687-714.
[10]Frentzos, E., Gratsias, K., Pelekis, N., and Theodoridis, Y.Algorithms for Nearest
Neighbor Search on Moving Object Trajectories, Geoinformatica 11, 2, 2007,159-193.
[11]Berg, M. , Cheong, O. , Kreveld, M., and Overmars, M.Computational Geometry:
Algorithms and Applications, Third Edition, Springer-Verlag, 2008.
[12]Lee, D. T. On k-Nearest Neighbor Voronoi Diagrams in the Plane, Computers, IEEE
Transactions on Volume:C-31, Issue:6, 1982, 478–487.
[13]Fortune, S. A sweep line algorithm for Voronoi diagrams, Proceedings of the second
annual symposium on Computational geometry, Yorktown Heights, New York, United
States, 1986, 313–322.
[14]Meyerhenke, H. Constructing Higher-Order Voronoi Diagrams in Parallel,
Proceedings of the 21st European Workshop on Computational Geometry, Eindhoven,
The Netherlands, 2005, 123-126.
[15]Gao, Y., Zheng, B., Chen, G., and Li, Q. Algorithms for constrained k-nearest neighbor
queries over moving object trajectories, Geoinformatica 14, 2 (April 2010 ), 241-276.
This paper may be cited as:
Abbasifard, M. R., Naderi, H. and Mirjalili, M., 2014. HOV-kNN: A New
Algorithm to Nearest Neighbor Search in Dynamic Space. International
Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp.
12-22.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 23
A Survey on Mobile Malware:
A War without End
Sonal Mohite
Sinhgad College of Engineering,
Vadgaon. Pune, India.
Prof. R. S. Sonar
Associate Professor
Sinhgad College of Engineering,
Vadgaon. Pune, India.
ABSTRACT
Nowadays, mobile devices have become an inseparable part of our everyday lives and its
usage has grown up exponentially. With the functionality upgrade of mobile phones, the
malware threat for mobile phones is expected to increase. This paper shades a light on
when and how the mobile malware got evolved. Current scenario of mobile operating
system shares’ and number and types of mobile malware are also described. Mobile
malware can be propagated via three communication media viz. SMS/MMS, Bluetooth/Wi-
Fi and FM-RDS. Several mobile malware detection techniques are explained with
implemented examples. When one uses the particular malware detection technique is
clarified along with its pros & cons. At first, static analysis of application is done and then a
dynamic analysis. If external ample resources are available then cloud-based analysis is
chosen. Application permission analysis and battery life monitoring are novel approaches
of malware detection. Along with malware detection, preventing mobile malware has
become critical. Proactive and reactive techniques of mobile malware control are defined
and explained. Few tips are provided to restrain malware propagation. Ultimately,
Structured and comprehensive overview of the research on mobile malware is explored.
Keywords
Mobile malware, malware propagation, malware control, malware detection.
1. INTRODUCTION
Before decades, computers were the only traditional devices used for
computing. Here and now, smart phones are used as supporting computing
devices with computers. With the increasing capabilities of such phones,
malware which was the biggest threat for computers is now become
widespread for smart phones too. The damage made by mobile malwares
includes theft of confidential data from device, eavesdropping of ongoing
conversation by third party, incurring extra charges through sending SMS to
premium rate numbers, and even location based tracking of user, which is
too severe to overlook. So there is a judicious requirement of understanding
the propagation means of mobile malware, various techniques to detect
mobile malware, and malware restraint.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 24
2. RELATED WORKS
Malware is a malicious piece of software which is designed to damage the
computer system & interrupt its typical working. Fundamentally, malware is
a short form of Malicious Software. Mobile malware is a malicious software
aiming mobile phones instead of traditional computer system. With the
evolution of mobile phones, mobile malware started its evolution too [1-4].
When propagation medium is taken into account, mobile viruses are of three
types: Bluetooth-based virus, SMS-based virus, and FM RDS based virus
[5-9]. A BT-based virus propagates through Bluetooth & Wi-Fi which has
regional impact [5], [7], and [8]. On the contrary, SMS-based virus follows
long-range spreading pattern & can be propagated through SMS & MMS
[5], [6], [8]. FM RDS based virus uses RDS channel of mobile radio
transmitter for virus propagation [9]. Our work addresses the effect of
operational behavior of user & mobility of a device in virus propagation.
There are several methods of malware detection viz. static method, dynamic
method, cloud-based detection method, battery life monitoring method,
application permission analysis, enforcing hardware sandbox etc. [10-18]. In
addition to work given in [10-18], our work addresses pros and cons of each
malware detection method. Along with the study of virus propagation &
detection mechanisms, methods of restraining virus propagation are also
vital. A number of proactive & reactive malware control strategies are given
in [5], [10].
3. EVOLUTION OF MOBILE MALWARE
Although, first mobile malware, ‘Liberty Crack’, was developed in year
2000, mobile malware evolved rapidly during years 2004 to 2006 [1].
Enormous varieties of malicious programs targeting mobile devices were
evolved during this time period & are evolving till date. These programs
were alike the malware that targeted traditional computer system: viruses,
worms, and Trojans, the latter including spyware, backdoors, and adware.
At the end of 2012, there were 46,445 modifications in mobile malware.
However, by the end of June 2013, Kaspersky Lab had added an aggregate
total of 100,386 mobile malware modifications to its system [2]. The total
mobile malware samples at the end of December 2013 were 148,778 [4].
Moreover, Kaspersky labs [4] have collected 8,260,509 unique malware
installation packs. This shows that there is a dramatic increase in mobile
malware. Arrival of ‘Cabir’, the second most mobile malware (worm)
developed in 2004 for Symbian OS, dyed-in-the-wool the basic rule of
computer virus evolution. Three conditions are needed to be fulfilled for
malicious programs to target any particular operating system or platform:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 25
 The platform must be popular: During evolution of ‘Cabir’, Symbian
was the most popular platform for smart phones. However,
nowadays it is Android, that is most targeted by attackers. These
days’ malware authors continue to ponder on the Android platform
as it holds 93.94% of the total market share in mobile phones and
tablet devices.
 There must be a well-documented development tools for the
application: Nowadays every mobile operating system developers
provides a software development kit & precise documentation which
helps in easy application development.
 The presence of vulnerabilities or coding errors: During the
evolution of ‘Cabir’, Symbian had number of loopholes which was
the reason for malware intrusion. In this day and age, same thing is
applicable for Android [3].
Share of operating system plays a crucial role in mobile malware
development. Higher the market share of operating system, higher is the
possibility of malware infection. The pie chart below illustrates the
operating system (platform) wise mobile malware distribution [4]:
Figure 1. OS wise malware distribution
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 26
4. MOBILE MALWARE PROPAGATION
There are 3 communication channels through which malware can propagate.
They are: SMS / MMS, Bluetooth / Wi-Fi, and FM Radio broadcasts.
4.1 SMS / MMS
Viruses that use SMS as a communication media can send copies of
themselves to all phones that are recorded in victim’s address book. Virus
can be spread by means of forwarding photos, videos, and short text
messages, etc. For propagation, a long-range spreading pattern is followed
which is analogous to the spreading of computer viruses like worm
propagation in e-mail networks [6]. For accurate study of SMS-based virus
propagation, one needs to consider certain operational patterns, such as
whether or not users open a virus attachment. Hence, the operational
behavior of users plays a vital role in SMS-based virus propagation [8].
4.1.1 Process of malware propagation
If a phone is infected with SMS-based virus, the virus regularly sends its
copies to other phones whose contact number is found in the contact list of
the infected phone. After receiving such distrustful message from others,
user may open or delete it as per his alertness. If user opens the message, he
is infected. But, if a phone is immunized with antivirus, a newly arrived
virus won’t be propagated even if user opens an infected message.
Therefore, the security awareness of mobile users plays a key role in SMS-
based virus propagation.
Same process is applicable for MMS-based virus propagation whereas
MMS carries sophisticated payload than that of SMS. It can carry videos,
audios in addition to the simple text & picture payload of SMS.
4.2 Bluetooth/ Wi-Fi
Viruses that use Bluetooth as a communication channel are local-contact
driven viruses since they infect other phones within its short radio range.
BT-based virus infects individuals that are homogeneous to sender, and each
of them has an equal probability of contact with others [7]. Mobility
characteristics of user such as whether or not a user moves at a given hour,
probability to return to visited places at the next time, traveling distances of
a user at the next time etc. are need to be considered [8].
4.2.1 Process of malware propagation
Unlike SMS-based viruses, if a phone is infected by a BT-based virus, it
spontaneously & atomically searches another phone through available
Bluetooth services. Within a range of sender mobile device, a BT-based
virus is replicated. For that reason, users’ mobility patterns and contact
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 27
frequency among mobile phones play crucial roles in BT-based virus
propagation.
Same process is followed for Wi-Fi where Wi-Fi is able to carry high
payload in large range than that of BT.
4.3 FM-RDS
Several existing electronic devices do not support data connectivity facility
but include an FM radio receiver. Such devices are low-end mobile phones,
media players, vehicular audio systems etc. FM provides FM radio data
system (RDS), a low-rate digital broadcast channel. It is proposed for
delivering simple information about the station and current program, but it
can also be used with other broad range of new applications and to enhance
existing ones as well [9].
4.3.1 Process of malware propagation
The attacker can attack in two different ways. The first way is to create a
seemingly benign app and upload it to popular app stores. Once the user
downloads & installs the app, it will contact update server & update its
functionality. This newly added malicious functionality decodes and
assembles the payload. At the end, the assembled payload is executed by the
Trojan app to uplift privileges of attacked device & use it for malicious
purpose. Another way is, the attacker obtains a privilege escalation exploit
for the desired target. As RDS protocol has a limited bandwidth, we need to
packetize the exploit. Packetization is basically to break up a multi-kilobyte
binary payload into several smaller Base64 encoded packets. Sequence
numbers are attached for proper reception of data at receiver side. The
received exploit is executed. In this way the device is infected with malware
[9].
5. MOBILE MALWARE DETECTION TECHNIQUE
Once the malware is propagated, malware detection is needed to be carried
out. In this section, various mobile malware detection techniques are
explained.
5.1 Static Analysis Technique
As the name indicates, static analysis is to evaluate the application without
execution [10-11]. It is an economical as well as fast approach to detect any
malevolent characteristics in an application without executing it. Static
analysis can be used to cover static pre-checks that are performed before the
application gets an entry to online application markets. Such application
markets are available for most major smartphone platforms e.g. ‘Play store’
for Android, ‘Store’ for windows operating system. . These extended pre-
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 28
checks enhance the malware detection probabilities and therefore further
spreading of malware in the online application stores can be banned. In
static analysis, the application is investigated for apparent security threats
like memory corruption flaws, bad code segment etc. [10], [12].
5.1.1 Process of malware detection
If the source code of application is available, static analysis tools can be
directly used for further examination of code.
But if the source code of the application is not available then executable app
is converted back to its source code. This process is known as
disassembling. Once the application is disassembled, feature extraction is
done. Feature extraction is nothing but observing certain parameters viz.
system calls, data flow, control flow etc. Depending on the observations,
anomaly is detected. In this way, application is categorized as either benign
or malicious.
Pros: Economical and fast approach of malware detection.
Cons: Source codes of applications are not readily available. And
disassembling might not give exact source codes.
Figure 2. Static Analysis Technique
5.1.2 Example
Figure 2 shows the malware detection technique proposed by Enck et al.
[12] for Android. Application’s installation image (.apk) is used as an input
to system. Ded, a Dalvik decompiler, is used to dissemble the code. It
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 29
generates Java source code from .apk image. Feature extraction is done by
using Fortify SCA. It is a static code analysis suite that provides four types
of analysis; control flow analysis, data flow analysis, structural analysis, and
semantic analysis. It is used to evaluate the recovered source code &
categorize the application as either benign or malicious.
5.2 Dynamic Analysis Technique
Dynamic analysis comprises of analyzing the actions performed by an
application while it is being executed. In dynamic analysis, the mobile
application is executed in an isolated environment such as virtual machine
or emulator, and the dynamic behavior of the application is monitored [10],
[11], [13]. There are various methodologies to perform dynamic analysis
viz. function call monitoring, function parameter analysis, Information flow
tracking, instruction trace etc. [13].
5.2.1 Process of malware detection
Dynamic analysis process is quite diverse than the static analysis. In this,
the application is installed in the standard Emulator. After installation is
done, the app is executed for a specific time and penetrated with random
user inputs. Using various methodologies mentioned in [13], the application
is examined. On the runtime behavior, the application is either classified as
benign or malicious.
Pros: Comprehensive approach of malware detection. Most of the malwares
is got detected in this technique.
Cons: Comparatively complex and requires more resources.
Figure 3. Dynamic Analysis Technique
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 30
5.2.2 Example
Figure 3 shows Android Application Sandbox (AASandbox) [14], the
dynamic malware detection technique proposed by Blasing et al. for
Android. It is a two-step analysis process comprising of both static &
dynamic analysis. The AASandbox first implements a static pre-check,
followed by a comprehensive dynamic analysis. In static analysis, the
application image binary is disassembled. Now the disassembled code is
used for feature extraction & to search for any distrustful patterns. After
static analysis, dynamic analysis is performed. In dynamic analysis, the
binary is installed and executed in an AASandbox. ‘Android Monkey’ is
used to generate runtime inputs. System calls are logged & log files are
generated. This generated log file will be then summarized and condensed to
a mathematical vector for better analysis. In this way, application is
classified as either benign or malicious.
5.3 Cloud-based Analysis Technique
Mobile devices possess limited battery and computation. With such
constrained resource availability, it is quite problematic to deploy a full-
fledged security mechanism in a smartphone. As data volume increases, it is
efficient to move security mechanisms to some external server rather than
increasing the working load of mobile device [10], [15].
5.3.1 Process of malware detection
In the cloud-based method of malware detection, all security computations
are moved to the cloud that hosts several replicas of the mobile phones
running on emulators & result is sent back to mobile device. This increases
the performance of mobile devices.
Pros: Cloud holds ample resources of each type that helps in more
comprehensive detection of malware.
Cons: Extra charges to maintain cloud and forward data to cloud server.
5.3.2 Example
Figure 4 shows Paranoid Android (PA), proposed by Portokalidis et al. [15].
Here, security analysis and computations are moved to a cloud (remote
server). It consists of 2 different modules, a tracer & replayer. A tracer is
located in each smart phone. It records all necessary information that is
required to reiterate the execution of the mobile application on remote
server. The information recorded by tracer is first filtered & encoded. Then
it is stored properly and synchronized data is sent to replayer over an
encrypted channel. Replayer is located in the cloud. It holds the replica of
mobile phone running on emulator & records the information communicated
by tracer. The replayer replays the same execution on the emulator, in the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 31
cloud. Cloud, the remote server, owns abundant resources to perform
multifarious analysis on the data collected from tracer. During the replay,
numerous security analyses such as dynamic malware analysis, memory
scanners, system call tracing, call graph analysis[15] etc. are performed
rather there is no limit on the number of attack detection techniques that we
can be applied in parallel.
Figure 4. Cloud-based Detection Technique
5.4 Monitoring Battery Consumption
Monitoring battery life is a completely different approach of malware
detection compared to other ones. Usually smartphones possess limited
battery capacity and need to be used judiciously. The usual user behavior,
existing battery state, signal strength and network traffic details of a mobile
is recorded over time and this data can be effectively used to detect hidden
malicious activities. By observing current energy consumption such
malicious applications can indeed be detected as they are expected to take in
more power than normal regular usage. Though, battery power consumption
is one of the major limitations of mobile phones that limit the complexity of
anti-malware solutions. A quite remarkable work is done in this field. The
introductory exploration in this domain is done by Jacoby and Davis [16].
5.4.1 Process of malware detection
After malware infection, that greedy malware keeps on repeating itself. If
the mean of propagation is Bluetooth then the device continuously scans for
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 32
adjacent Bluetooth-enabled devices which in turn consume a remarkable
amount of power. This time-domain data of power consumption collected
over a period of time is transformed into frequency-domain data &
represented as dominant frequencies. The malwares are identified from
these certain dominant frequencies.
Pros: Economical and novel approach of malware detection.
Cons: Because of multi-functionality of smart phones, power consumption
model of smart phone could not be accurately defined.
5.4.2 Example
Recent work by Liu et al. [17] proposed another detection technique by
comparing the compressed sequences of the power consumption value in
each time interval. They defined a user-centric power model that relies on
user actions. User actions such as duration & frequency of calls, number of
SMS, network usage are taken into account. Their work uses machine
learning techniques to generate rules for malware detection.
5.5 Application Permission Analysis
With the advancements in mobile phone technology, users have started
downloading third party application. These applications are available in
third party application stores. While developing any application, application
developers need to take required permissions from device in order to make
the application work on that device. Permissions hold a crucial role in
mobile application development as they convey the intents and back-end
activities of the application to the user. Permissions should be precisely
defined & displayed to the user before the application is installed. Though,
some application developers hide certain permissions from user & make the
application vulnerable & malicious application.
5.5.1 Process of malware detection
Security configuration of an application is extracted. Permissions taken by
an application are analyzed. If application has taken any unwanted
applications then it is categorized as malicious.
Pros: Fewer resources are required compared to other techniques.
Cons: Analyzing only the permissions request is not adequate for mobile
malware detection; it needs to be done in parallel with static and/or dynamic
analysis.
5.5.2 Example
Kirin, proposed by Enck et al. (2009) [18] is an application certification
system for Android. During installation, Kirin crisscrosses the application
permissions. It extracts the security configurations of the application
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 33
&checks it against the templates i.e. security policy rules already defined by
Kirin. If any application becomes unsuccessful to clear all the security
policy rules, Kirin either deletes the application or alerts the user for
assistance [18].
6. MOBILE MALWARE CONTROL STRATEGIES
Basically, there are two types of malware control strategies, viz. proactive &
reactive control. In proactive malware control strategy, malware is mitigated
before its propagation. Proper set of preventive measures is used for this
purpose. While, in reactive malware control strategy, malware is first
propagated and then a reaction is taken upon malware contamination.
6.1 Proactive Malware Control Strategy
Here are some of the proactive malware control techniques given in [10];
however, users’ own security awareness plays a crucial role.
 Install a decent mobile security application i.e. antivirus.
 Always download apps from trusted official application markets.
Before downloading any app, do read the reviews and ratings of the
app. During installation, always remember to read the permissions
requested by the app and if it appears doubtful don’t install it.
Always keep installed apps up-to-date.
 Turn-off Wi-Fi, Bluetooth, and other short range wireless
communication media when not to be used. Stay more conscious
when connecting to insecure public Wi-Fi networks & accepting
Bluetooth data from unknown sender.
 When confidential data is to be stored in the mobile phone, encrypt it
before storing and set a password for access. Do regular back-ups.
Assure that the sensitive information is not cached locally in the
mobile phone.
 Always keep an eye on the battery life, SMS and call charges, if
found any few and far between behaviors, better go for an in-depth
check on the recently installed applications.
 During internet access, don’t click on links that seem suspicious or
not trustworthy.
 Finally, in case of mobile phone theft, delete all contacts,
applications, and confidential data remotely.
6.2Reactive Malware Control Strategy
When the malware is detected then the control strategy is implemented, is
the working principle of reactive malware control strategy. Antivirus
solution comes under proactive malware control, however when a new
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 34
malware is found, antivirus updates for that malware are implemented and
forwarded to mobile phones, is a part of reactive malware control. This is
known as adaptive patch dissemination.
Adaptive Patch Dissemination
A pre-immunization like antivirus is used to protect networks before virus
propagation. However, in reality, we first detect certain viruses and then
update antivirus, known as patches. These patches are forwarded into
networks only after these viruses have already propagated. Network
bandwidth limits the speed with which the security notifications or patches
can be sent to all users simultaneously. Therefore, a new strategy namely
adaptive dissemination strategy is developed. It is based on the Autonomy
Oriented Computing (AOC) methodology which helps to send security
notifications or patches to most of phones with a relatively lower
communication cost. The AOC is used to search a set of the highly
connected phones with large communication abilities in a mobile network
[5].
7. CONCLUSION
Rapid growth in smart phone development resulted in evolution of mobile
malware. Operating system shares’ plays crucial role in malware evolution.
SMS/MMS is the fastest way of mobile malware propagation as it has no
geographical boundary like BT/Wi-Fi. FM-RDS is still evolving. Among all
malware detection techniques, static malware detection is performed first
during pre-checks. Later dynamic analysis is performed and can be
combined with application permission analysis. Cloud-based analysis is
more comprehensive approach as it uses external resources to perform
malware detection and can perform more than one type of analysis
simultaneously. Proactive control strategy is used to control malware before
its propagation while reactive control strategy is used after malware is
propagated.
REFERENCES
[1] La Polla, M., Martinelli, F., & Sgandurra, D. (2012). A survey on security for mobile
devices. IEEE Communications Surveys & Tutorials, 15(1), 446 – 471.
[2] Kaspersky Lab IT Threat Evolution: Q2 2013. (2013). Retrieved from
http://www.kaspersky.co.in/about/news/virus/2013/kaspersky_lab_it_threat_evolution_q2_
2013.
[3] Kaspersky Security Bulletin 2013: Overall statistics for 2013. (2013 December).
Retrieved from
http://www.securelist.com/en/analysis/204792318/Kaspersky_Security_Bulletin_2013_Ove
rall_statistics_for_2013.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 35
[4] Maslennikov, D. Mobile Malware Evolution: Part 6. (2013 February). Retrieved from
http://www.securelist.com/en/analysis/ 204792283/Mobile_Malware_Evolution_Part_6.
[5] Gao, C., and Liu, J. (2013). Modeling and restraining mobile virus propagation. IEEE
transactions on mobile computing, 12(3), 529-541.
[6] Gao, C. and Liu, J. (2011). Network immunization and virus propagation in Email
networks: Experimental evaluation and analysis. Knowledge and information systems,
27(2), 253-279.
[7] Yan, G., and Eidenbenz, S. (2009, March). Modeling propagation dynamics of
Bluetooth worms (extended version). IEEE transactions on Mobile Computing, 8(3), 353-
368.
[8] Gonzalez, M., Hidalgo, C., and Barabasi, A. (2008). Understanding individual human
mobility patterns. Nature, 453(7196), 779-782.
[9] Fernandes, E., Crispo, B., Conti, M. (2013, June). FM 99.9, Radio virus: Exploiting
FM radio broadcasts for malware deployment. Transactions on information forensics and
security, 8(6), 1027-1037.
[10] Chandramohan, M., and Tan, H. (2012). Detection of mobile malware in the wild.
IEEE computer society, 45(9), 65-71.
[11] Yan, Q., Li, Y., Li, T., and Deng, R. (2009). Insights into malware detection and
prevention on mobile phones. Springer-Verlag Berlin Heidelberg, SecTech 2009, 242–249.
[12] Enck, W., Octeau, D., Mcdaniel, P., and Chaudhuri, S. (2011 August). A study of
android application security. The 20th Usenix security symposium.
[13] Egele, M., Scholte, T., Kirda, E., Kruegel, C. (2012 February). A survey on automated
dynamic malware-analysis techniques and tools. ACM-TRANSACTION, 4402(06), 6-48.
[14] Blasing, T., Batyuk, L., Schmidt, A., Camtepe, S., and Albayrak, S. (2010). An
android application sandbox system for suspicious software detection. 5th International
Conference on Malicious and Unwanted Software.
[15] Portokalidis, G., Homburg, P., Anagnostakis, K., Bos, H. (2010 December). Paranoid
android: Versatile protection for smartphones. ACSAC'10.
[16] Jacoby, G. (2004). Battery-based intrusion detection. The Global Telecommunications
Conference.
[17] Liu, L., Yan, G., Zhang, X., and Chen, S. (2009). Virusmeter: Preventing your
cellphone from spies. RAID, 5758, 244-264.
[18] Enck, W., Ongtang, M., and Mcdaniel, P. (2009 November). On lightweight mobile
phone application certification. 16th ACM Conference on Computer and Communications
Security.
This paper may be cited as:
Mohite, S. and Sonar, R. S., 2014. A Survey on Mobile Malware: A War
without End. International Journal of Computer Science and Business
Informatics, Vol. 9, No. 1, pp. 23-35.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 36
An Efficient Design Tool to Detect
Inconsistencies in UML Design Models
Mythili Thirugnanam
Assistant Professor (Senior)
School of Computing Science and Engineering
VIT University,Vellore, Tamil Nadu
Sumathy Subramaniam
Assistant Professor (SG)
School of Information Technology and Engineering
VIT University, Vellore, Tamil Nadu
ABSTRACT
Quality of any software developed is evaluated based on the design aspect. Design is one of
the most important phases in software life cycle. Poor process design leads to high failure
rate of the software. To design the software, various traditional and UML models are
widely used. There are many tools proposed and are available to design the UML models as
per the user requirements. However, these tools do not support validation of UML models
which, ultimately leads to design errors. Most of the existing testing tools check for
consistency of the UML models. Some tools check for inconsistency of the UML models
that does not follow the consistency rule required for UML models. The proposed work
aims to develop an efficient tool, which detects the inconsistency in the given UML
models. Parsing techniques are applied to extract the XML tags. The extracted tags contain
relevant details such as class name, attribute name, operation name and the association with
their corresponding names in Class diagram in the Meta model format. On adopting the
consistency rules for the given input UML model, inconsistency is detected and a report is
generated. From the inconsistency report, error efficiency and design efficiency is
computed.
Keywords
Software Design, Unified Modeling Language (UML), Testing, Extensible Markup
Language (XML).
1. INTRODUCTION
In present day scenario, software programming is moving towards high-
level design, which raises new research issues and a scope for developing
new set of tools that supports design specification. Most research in
software specification use verification and validation techniques to prove
correctness in terms of certain properties. The delivery of high-quality
software product is a major goal in software engineering. An important
aspect is to achieve error free software product that assures quality of the
software. Inspection and testing are common verification and validation (V
& V) approaches for defect detection in the software development process.
Existing statistical data shows that the cost of finding and repairing
software bugs raises drastically in later development stages. The Unified
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 37
Modeling Language (UML) is now widely accepted as the standard
modeling language for software construction and is gaining wide
acceptance. The class diagram in its core view provides the backbone for
any modeling effort and has well formed semantics.
2. BACKGROUND STUDY
Alexander Egyed [4, 5] presents an automated approach for detecting and
tracking inconsistencies in real time and to automatically identify changes in
various models that affect the consistency rules. The approach observes the
behavior of consistency rules to understand how they affect the model.
Techniques for efficiently detecting inconsistencies in UML Models
identifying the changes required to fix problems are analyzed. The work
describes a technique for automatically generating a set of concrete changes
for fixing inconsistencies and providing information about the impact of
each change on all consistency rules. The approach is integrated with the
design tool IBM Rational Rose TM. Muhammad Usman [9] presents a
survey of UML consistency checking techniques by analyzing various
parameters and constructs an analysis table. The analysis table helps
evaluate existing consistency checking techniques and concludes that most
of the approaches validate intra and inter level consistencies between UML
models by using monitoring strategy. UML class, sequence, and state chart
diagrams are used in most of the existing consistency checking techniques.
Alexander Egyed demonstrates [3] that a tool can assist the designer in
discovering unintentional side effects, locating choices for fixing
inconsistencies, and then in changing the design model.
The paper examines the impact of changes on UML design models [10] and
explores the methodology to discover the negative side effects of design
changes, and to predict the positive and negative impact of these choices.
Alexander Egyed [1, 2] presents an approach for quickly, correctly, and
automatically deciding the consistency rules required to evaluate when a
model changes. The approach does not require consistency rules with
special annotations. Instead, it treats consistency rules as black-box entities
and observes their behavior during their evaluation to identify the different
types of model elements they access.
Christian Nentwich [6, 7] presents a repair framework for inconsistent
distributed documents for generating interactive repairs from full first order
logic formulae that constrain the documents. A full implementation of the
components as well as their application to the UML and related
heterogeneous documents such as EJB deployment descriptors are
presented. This approach can be used as an infrastructure for building high
domain specific frameworks. Researchers have focused to remove
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 38
inconsistencies in few UML Models. The work proposed in [11] attempts to
address and detect inconsistencies in UML Models like Class diagram, Use
case diagram, Sequence diagram and so on. A survey exploring the impact
of model driven software development is given in [12]. Change in impact
analysis, consistency management and uncertainty management,
inconsistency detection and resolution rules are dealt in the work.
3. FRAME WORK OF THE PROPOSED WORK
Figure 1. Framework of the proposed work
4. DETAILED DESCRIPTION OF THE PROPOSED WORK
The framework of the proposed work is given in Figure 1.
4.1. Converting UML model into XML file
An UML design diagram does not support to directly detect the
inconsistency which is practically impossible. UML model is converted into
XML file for detecting the inconsistency in the model. UML models such as
use case diagram, class diagram and sequence diagram can be taken as input
for this tool. The final output of this module is XML file which is used
further to detect the inconsistency. The snapshot of getting input file is
shown in Figure 2.
Extract the XML tags
Apply parsing
Technique
Applying consistency
rules
Detect Inconsistency in the
given input
Generate the
Inconsistency report
Select UML model Convert UML model into
XML file
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 39
Procedure used:
 Convert the chosen input design into a XML file
 Select Input File Export as XML file VP-UML project
 Select the diagram that needs to be exported
 Select the location for exported file to be stored
The input file is read from the user to carry out further process (Figure 2).
Here, Use Case Diagram is read as input file. The input diagram is stored
as XML file and passed as the input to the next process that extracts the
XML tags.
4.2. Extracting the XML tags and applying the parsing technique
From the XML file, the XML tags are extracted. The parsing technique is
applied on the XML tags to identify the related information of the given
model which is in Meta model format [3]. For example, in class diagram,
the class name, its attributes and methods are identified. All the related
information of the given input model is extracted.
Procedure used:
 Open the XML file
 Copy the file as text file
 Split the tag into tokens Extract the relevant information about
the diagram
 Save the extracted result in a file.
Figure 3 & 4 describes the above mentioned procedure. The XML file is
considered as the input for this step. This method adopts the tokenizer
concept to split the tags and store.
4.3. Detecting the design inconsistency:
The consistency rules [8, 10] are applied on the related information of the
given input design diagram to detect the inconsistency. The related
information which does not satisfy the rule has design inconsistency for the
given input model. All possible inconsistency is detected as described
below. Figure 5 shows the inconsistencies in given use case diagram.
4.3.1. Consistency rule for the Class Diagram:
 Visibility of a member should be given.
 Visibility of all attributes should be private.
 Visibility of all methods should be public.
 Associations should have cardinality relationship.
 When one class depends on another class, there should be class
interfaces notation.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 40
4.3.2. Consistency rule for the Use Case Diagram
Every actor has at least one relationship with the use case.
 System boundary should be defined.
 All the words that suggest incompleteness should be removed
such as some and etc.
4.3.3. Consistency rule for the Sequence Diagram
 All objects should have at least one interaction with any other object
 For each message proper parameters should be included
Procedure used:
 Select the Input design model
 Based on the chosen design model (Class diagram, Use case diagram
and Sequence diagram) inconsistency is detected and the extracted
result is compared with given consistency rule.
4.4. Generating the inconsistency report
A collective report is generated for all the inconsistencies that are detected
in the given input model. The report provides the overall inconsistency of
the given input model which is taken care during the implementation.
4.5. Computing Design Efficiency
The total number of possible errors in the design model is estimated [10].
Then the total number of errors found in the input design model is
determined with the procedures discussed. The error efficiency is computed
using equation 1. From the calculated error efficiency of the design, the
design efficiency is computed using equation 2. The implementation of the
same is shown in Figure 6.
[eq 1]
[eq 2]
5. RESULTS & DISCUSSION
In the recent past there has been a blossoming development of new
approaches in software design and testing. The proposed system primarily
aims to detect the inconsistency which provides efficient design
specification. Though there is a lot of research going on in detecting
inconsistencies in various UML models, not much work is carried out in
Use Case diagram & Class diagram. The developed system doesn’t have
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 41
any constraint on the maximum number of lines of code. This added feature
makes this tool more versatile when compared with the existing tools.
Various design models for different scenarios were taken as samples and
tested for consistency. The results obtained proved that the developed tool
was able to detect all the inconsistencies available in the given input model.
Figure 2. Selecting input model (UML model is the chosen Use Case Design)
Figure 3. Snapshot shows the XML Format file that extracted from the input UML Model
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 42
Figure 4. Snapshot shows relavent information obtained from the given design from XML file
Figure 5. Snapshot shows inconsistency details for the given input design
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 43
Figure 6 . Snapshot shows efficency of the given input design model
6. CONCLUSION AND FUTURE ENHANCEMENT
Inspection and testing of the software are the important approaches in
software engineering practice that addresses to reduce the amount of defects
in software products. Software inspection focuses on design specifications
in early phases of software development whereas traditional testing
approaches focus on implementation phases or later. Software inspection is
widely regarded as an effective defect finding technique. Recent research
has considered the application of tool support as a means to increase its
efficiency. During the design model, construction and validation of variety
of faults can be found. Testing at the early phase in software life cycle not
only increases quality but also reduces the cost incurred. The developed tool
can help to enforce the inspection process and provide support for finding
defects in the design model, and also compute the design efficiency on
deriving the error efficiency. This work would take care of the major
constraints imposed while creating design models such as class diagram, use
case diagram and sequence diagram. Further enhancement of the proposed
work is to address the other major constraints in class diagrams such as
inheritance, association, cardinality constraints and so on.
REFERENCES
[1] A.Egyed and D.S.Wile, Supporting for Managing Design-Time Decision, IEEE
Transactions on Software Engineering, 2006.
[2] A.Egyed, Fixing Inconsistencies in UML Design Models, ICSE, 2007.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 44
[3] A.Egyed, Instant Consistency Checking for UML, Proceedings of the International
Conference on Software Engineering, 2006.
[4] A.Egyed, E.Letier, A.Finkelstein, Generating and Evaluating Choices for Fixing
Inconsisentices in UML Design Models, International Conference on Software
Engineering, 2008.
[5] A Egyed, Automatically Detecting and Tracking Inconsistencies in Software Design
Models IEEE Transactions on Software Engineering, ISSN: 0098-5589, 2009.
[6] C.Nentwich, I.Capra and A.Finkelstein, xlinkit: a consistency checking and smart link
generation service, ACM transactions on Internet Technology, 2002.
[7] C.Nentwich, W. Emmerich and A.Finkelstein, Consistency Management with Repair
Actions, ICSE, 2003.
[8] Diana kalibatiene , Olegas Vasilecas , Ruta Dubauskaite , Ensuring Consistency in
Different IS models – UML case study , Baltic J.Modern Computing , Vol.1 , No.1-
2,pp.63-76 ,2013.
[9] Muhammad Usman, Aamer Nadeem, Tai-hoon Kim, Eun-suk Cho, A Survey of
Consistency Checking Techniques for UML Models , Advanced Software Engineering
& Its Applications,2008.
[10]R. Dubauskaite, O.Vasilecas, Method on specifying consistency rules among different
aspect models, expressed in UML, Elektronika ir elekrotechnika , ISSN 1392 -1215.
Vol.19, No.3, 2013.
[11]Rumbaugh, J., Jacobson, I., Booch, G., The Unified Modeling Language Reference
Manual. AddisonWesley, 1999.
[12] Amal Khalil and Juergen Dingel, Supporting the evolution of UML models in model
driven software developmeny: A Survey, Technical Report, School of Computing,
Queen’s University, Canada, Feb 2013.
This paper may be cited as:
Thirugnanam, M. and Subramaniam, S., 2014. An Efficient Design Tool to
Detect Inconsistencies in UML Design Models. International Journal of
Computer Science and Business Informatics, Vol. 9, No. 1, pp. 36-44.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 45
An Integrated Procedure for Resolving
Portfolio Optimization Problems using
Data Envelopment Analysis, Ant
Colony Optimization and Gene
Expression Programming
Chih-Ming Hsu
Minghsin University of Science and Technology
1 Hsin-Hsing Road, Hsin-Fong, Hsinchu 304, Taiwan, ROC
ABSTRACT
The portfolio optimization problem is an important issue in the field of investment/financial
decision-making and is currently receiving considerable attention from both researchers and
practitioners. In this study, an integrated procedure using data envelopment analysis (DEA),
ant colony optimization (ACO) for continuous domains and gene expression programming
(GEP) is proposed. The procedure is evaluated through a case study on investing in stocks
in the semiconductor sub-section of the Taiwan stock market. The potential average six-
month return on investment of 13.12% from November 1, 2007 to July 8, 2011 indicates
that the proposed procedure can be considered a feasible and effective tool for making
outstanding investment plans. Moreover, it is a strategy that can help investors make profits
even though the overall stock market suffers a loss. The present study can help an investor
to screen stocks with the most profitable potential rapidly and can automatically determine
the optimal investment proportion of each stock to minimize the investment risk while
satisfying the target return on investment set by an investor. Furthermore, this study fills the
scarcity of discussions about the timing for buying/selling stocks in the literature by
providing a set of transaction rules.
Keywords
Portfolio optimization, Data envelopment analysis, Ant colony optimization, Gene
expression programming.
1. INTRODUCTION
Portfolio optimization is a procedure that aims to find the optimal
percentage asset allocation for a finite set of assets, thus giving the highest
return for the least risk. It is an important issue in the field of
investment/financial decision-making and currently receiving considerable
attention from both researchers and practitioners. The first parametric model
applied to the portfolio optimization problem was proposed by Harry M.
Markowitz [1]. This is the Markowitz mean-variance model, which is the
foundation for modern portfolio theory. The non-negativity constraint
makes the standard Markowitz model NP-hard and inhibits an analytic
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 46
solution. Although quadratic programming can be used to solve the problem
with a reasonably small number of different assets, it becomes much more
difficult if the number of assets is increased or if additional constraints, such
as cardinality constraints, bounding constraints or other real-world
requirements, are introduced.
Therefore, various approaches for tackling portfolio optimization problems
using heuristic techniques have been proposed. For example,
Anagnostopoulos and Mamanis [2] formulated the portfolio selection as a
tri-objective optimization problem that aims to simultaneously maximize the
expected return, as well as minimize risk and the number of assets held in
the portfolio. In addition, their proposed model also considered quantity
constraints and class constraints intended to limit the proportion of the
portfolio invested in assets with common characteristics and to avoid very
small holdings. The experimental results and a comparison revealed that
SPEA2 (strength Pareto evolutionary algorithm 2) [4] is the best algorithm
both for the constrained and unconstrained portfolio optimization problem,
while PESA (Pareto envelope-based selection algorithm) [3] is the runner-
up and the fastest approach of all models compared. Deng and Lin [5]
proposed an approach for resolving the cardinality constrained Markowitz
mean-variance portfolio optimization problem based on the ant colony
optimization (ACO) algorithm. Their proposed method was demonstrated
using test data from the Hang Seng 31, DAX 100, FTSE 100, S&P 100, and
Nikkei 225 indices from March 1992 to September 1997, which yielded
adequate results. Chen et al.[6]proposed a decision-making model of
dynamic portfolio optimization for adapting to the change of stock prices
based on time adapting genetic network programming (TA-GNP) to
generate portfolio investment advice. They determined the distribution of
initial capital to each brand in the portfolio, as well as to create trading rules
for buying and selling stocks on a regular basis, by using technical indices
and candlestick chart as judgment functions. The effectiveness and
efficiency of their proposed method was demonstrated by an experiment on
the Japanese stock market. The comparative results clarified that the TA-
GNP generates more profit than the traditional static GNP, genetic
algorithms (GAs), and the Buy & Hold method. Sun et al. [7] modified the
update equations of velocity and position of the particle in particle swarm
optimization (PSO) and proposed the drift particle swarm optimization
(DPSO) to resolve the multi-stage portfolio optimization (MSPO) problem
where transactions take place at discrete time points during the planning
horizon. The authors illustrated their approach by conducting experiments
on the problem with different numbers of stages in the planning horizon
using sample data collected from the S&P 100 index. The experimental
results and a comparison indicated that the DPSO heuristic can yield
superior efficient frontiers compared to PSO, GAs and two classical
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014
Vol 9 No 1 - January 2014

Contenu connexe

Tendances

Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368
Editor IJARCET
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
butest
 
Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2
IAEME Publication
 

Tendances (18)

2.mathematics for machine learning
2.mathematics for machine learning2.mathematics for machine learning
2.mathematics for machine learning
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
 
data_mining
data_miningdata_mining
data_mining
 
50120140504015
5012014050401550120140504015
50120140504015
 
I Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesI Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for Trees
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...
Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...
Mathematical Analysis of Half Volume DRA with Performance Evaluation for High...
 
Energy detection technique for
Energy detection technique forEnergy detection technique for
Energy detection technique for
 
Improved Parallel Algorithm for Time Series Based Forecasting Using OTIS-Mesh
Improved Parallel Algorithm for Time Series Based Forecasting Using OTIS-MeshImproved Parallel Algorithm for Time Series Based Forecasting Using OTIS-Mesh
Improved Parallel Algorithm for Time Series Based Forecasting Using OTIS-Mesh
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...
ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...
ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...
 
Visual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challengesVisual analysis of large graphs state of the art and future research challenges
Visual analysis of large graphs state of the art and future research challenges
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2Meta heuristic based clustering of two-dimensional data using-2
Meta heuristic based clustering of two-dimensional data using-2
 
Clustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining ofClustbigfim frequent itemset mining of
Clustbigfim frequent itemset mining of
 

Similaire à Vol 9 No 1 - January 2014

Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368
Editor IJARCET
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
neeraj7svp
 

Similaire à Vol 9 No 1 - January 2014 (20)

Data-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning TechniquesData-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
Data-Driven Hydrocarbon Production Forecasting Using Machine Learning Techniques
 
Fault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curvesFault diagnosis using genetic algorithms and principal curves
Fault diagnosis using genetic algorithms and principal curves
 
Fault diagnosis using genetic algorithms and
Fault diagnosis using genetic algorithms andFault diagnosis using genetic algorithms and
Fault diagnosis using genetic algorithms and
 
40120130405012
4012013040501240120130405012
40120130405012
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
 
Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368Ijarcet vol-2-issue-7-2363-2368
Ijarcet vol-2-issue-7-2363-2368
 
A novel architecture of rns based
A novel architecture of rns basedA novel architecture of rns based
A novel architecture of rns based
 
Neural Network Implementation Control Mobile Robot
Neural Network Implementation Control Mobile RobotNeural Network Implementation Control Mobile Robot
Neural Network Implementation Control Mobile Robot
 
Optimized Parameter of Wavelet Neural Network (WNN) using INGA
Optimized Parameter of Wavelet Neural Network (WNN) using INGAOptimized Parameter of Wavelet Neural Network (WNN) using INGA
Optimized Parameter of Wavelet Neural Network (WNN) using INGA
 
Optimized Parameter of Wavelet Neural Network (WNN) using INGA
Optimized Parameter of Wavelet Neural Network (WNN) using INGAOptimized Parameter of Wavelet Neural Network (WNN) using INGA
Optimized Parameter of Wavelet Neural Network (WNN) using INGA
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
4Data Mining Approach of Accident Occurrences Identification with Effective M...
4Data Mining Approach of Accident Occurrences Identification with Effective M...4Data Mining Approach of Accident Occurrences Identification with Effective M...
4Data Mining Approach of Accident Occurrences Identification with Effective M...
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
PSF_Introduction_to_R_Package_for_Pattern_Sequence (1)
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
30
3030
30
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
20320140501002
2032014050100220320140501002
20320140501002
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
Multi fractal analysis of human brain mr image
Multi fractal analysis of human brain mr imageMulti fractal analysis of human brain mr image
Multi fractal analysis of human brain mr image
 

Plus de ijcsbi

Vol 14 No 2 - September 2014
Vol 14 No 2 - September 2014Vol 14 No 2 - September 2014
Vol 14 No 2 - September 2014
ijcsbi
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
ijcsbi
 
Vol 13 No 1 - May 2014
Vol 13 No 1 - May 2014Vol 13 No 1 - May 2014
Vol 13 No 1 - May 2014
ijcsbi
 
Vol 12 No 1 - April 2014
Vol 12 No 1 - April 2014Vol 12 No 1 - April 2014
Vol 12 No 1 - April 2014
ijcsbi
 
Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013
ijcsbi
 

Plus de ijcsbi (20)

Vol 17 No 2 - July-December 2017
Vol 17 No 2 - July-December 2017Vol 17 No 2 - July-December 2017
Vol 17 No 2 - July-December 2017
 
Vol 17 No 1 - January June 2017
Vol 17 No 1 - January June 2017Vol 17 No 1 - January June 2017
Vol 17 No 1 - January June 2017
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Vol 16 No 1 - January-June 2016
Vol 16 No 1 - January-June 2016Vol 16 No 1 - January-June 2016
Vol 16 No 1 - January-June 2016
 
Vol 15 No 6 - November 2015
Vol 15 No 6 - November 2015Vol 15 No 6 - November 2015
Vol 15 No 6 - November 2015
 
Vol 15 No 5 - September 2015
Vol 15 No 5 - September 2015Vol 15 No 5 - September 2015
Vol 15 No 5 - September 2015
 
Vol 15 No 4 - July 2015
Vol 15 No 4 - July 2015Vol 15 No 4 - July 2015
Vol 15 No 4 - July 2015
 
Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015Vol 15 No 3 - May 2015
Vol 15 No 3 - May 2015
 
Vol 15 No 2 - March 2015
Vol 15 No 2 - March 2015Vol 15 No 2 - March 2015
Vol 15 No 2 - March 2015
 
Vol 15 No 1 - January 2015
Vol 15 No 1 - January 2015Vol 15 No 1 - January 2015
Vol 15 No 1 - January 2015
 
Vol 14 No 3 - November 2014
Vol 14 No 3 - November 2014Vol 14 No 3 - November 2014
Vol 14 No 3 - November 2014
 
Vol 14 No 2 - September 2014
Vol 14 No 2 - September 2014Vol 14 No 2 - September 2014
Vol 14 No 2 - September 2014
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
 
Vol 13 No 1 - May 2014
Vol 13 No 1 - May 2014Vol 13 No 1 - May 2014
Vol 13 No 1 - May 2014
 
Vol 12 No 1 - April 2014
Vol 12 No 1 - April 2014Vol 12 No 1 - April 2014
Vol 12 No 1 - April 2014
 
Vol 11 No 1 - March 2014
Vol 11 No 1 - March 2014Vol 11 No 1 - March 2014
Vol 11 No 1 - March 2014
 
Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014
 
Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013Vol 8 No 1 - December 2013
Vol 8 No 1 - December 2013
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013
 
Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Dernier (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 

Vol 9 No 1 - January 2014

  • 1. ISSN: 1694-2507 (Print) ISSN: 1694-2108 (Online) International Journal of Computer Science and Business Informatics (IJCSBI.ORG) VOL 9, NO 1 JANUARY 2014
  • 2. Table of Contents VOL 9, NO 1 JANUARY 2014 A Predictive Stock Data Analysis with SVM-PCA Model .......................................................................1 Divya Joseph and Vinai George Biju HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space.......................................... 12 Mohammad Reza Abbasifard, Hassan Naderi and Mohadese Mirjalili A Survey on Mobile Malware: A War without End................................................................................... 23 Sonal Mohite and Prof. R. S. Sonar An Efficient Design Tool to Detect Inconsistencies in UML Design Models............................................. 36 Mythili Thirugnanam and Sumathy Subramaniam An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment Analysis, Ant Colony Optimization and Gene Expression Programming ................................................. 45 Chih-Ming Hsu Emerging Technologies: LTE vs. WiMAX ................................................................................................... 66 Mohammad Arifin Rahman Khan and Md. Sadiq Iqbal Introducing E-Maintenance 2.0 ................................................................................................................. 80 Abdessamad Mouzoune and Saoudi Taibi Detection of Clones in Digital Images........................................................................................................ 91 Minati Mishra and Flt. Lt. Dr. M. C. Adhikary The Significance of Genetic Algorithms in Search, Evolution, Optimization and Hybridization: A Short Review ...................................................................................................................................................... 103 IJCSBI.ORG
  • 4. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 1 A Predictive Stock Data Analysis with SVM-PCA Model Divya Joseph PG Scholar, Department of Computer Science and Engineering Christ University Faculty of Engineering Christ University, Kanmanike, Mysore Road, Bangalore - 560060 Vinai George Biju Asst. Professor, Department of Computer Science and Engineering Christ University Faculty of Engineering Christ University, Kanmanike, Mysore Road, Bangalore – 560060 ABSTRACT In this paper the properties of Support Vector Machines (SVM) on the financial time series data has been analyzed. The high dimensional stock data consists of many features or attributes. Most of the attributes of features are uninformative for classification. Detecting trends of stock market data is a difficult task as they have complex, nonlinear, dynamic and chaotic behaviour. To improve the forecasting of stock data performance different models can be combined to increase the capture of different data patterns. The performance of the model can be improved by using only the informative attributes for prediction. The uninformative attributes are removed to increase the efficiency of the model. The uninformative attributes from the stock data are eliminated using the dimensionality reduction technique: Principal Component Analysis (PCA). The classification accuracy of the stock data is compared when all the attributes of stock data are being considered that is, SVM without PCA and the SVM-PCA model which consists of informative attributes. Keywords Machine Learning, stock analysis, prediction, support vector machines, principal component analysis. 1. INTRODUCTION Time series analysis and prediction is an important task in all fields of science for applications like forecasting the weather, forecasting the electricity demand, research in medical sciences, financial forecasting, process monitoring and process control, etc [1][2][3]. Machine learning techniques are widely used for solving pattern prediction problems. The financial time series stock prediction is considered to be a very challenging task for analysts, investigator and economists [4]. A vast number of studies in the past have used artificial neural networks (ANN) and genetic algorithms for the time series data [5]. Many real time applications are using the ANN tool for time-series modelling and forecasting [6]. Furthermore the
  • 5. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 2 researchers hybridized the artificial intelligence techniques. Kohara et al. [7] incorporated prior knowledge to improve the performance of stock market prediction. Tsaih et al. [8] integrated the rule-based technique and ANN to predict the direction of the S& P 500 stock index futures on a daily basis. Some of these studies, however, showed that ANN had some limitations in learning the patterns because stock market data has tremendous noise and complex dimensionality [9]. ANN often exhibits inconsistent and unpredictable performance on noisy data [10]. However, back-propagation (BP) neural network, the most popular neural network model, suffers from difficulty in selecting a large number of controlling parameters which include relevant input variables, hidden layer size, learning rate, and momentum term [11]. This paper proceeds as follows. In the next section, the concepts of support vector machines. Section 3 describes the principal component analysis. Section 4 describes the implementation and model used for the prediction of stock price index. Section 5 provides the results of the models. Section 6 presents the conclusion. 2. SUPPORT VECTOR MACHINES Support vector machines (SVMs) are very popular linear discrimination methods that build on a simple yet powerful idea [12]. Samples are mapped from the original input space into a high-dimensional feature space, in which a „best‟ separating hyperplane can be found. A separating hyperplane H is best if its margin is largest [13]. The margin is defined as the largest distance between two hyperplanes parallel to H on both sides that do not contain sample points between them (we will see later a refinement to this definition) [12]. It follows from the risk minimization principle (an assessment of the expected loss or error, i.e., the misclassification of samples) that the generalization error of the classifier is better if the margin is larger. The separating hyperplane that are the closest points for different classes at maximum distance from it is preferred, as the two groups of samples are separated from each other by a largest margin, and thus least sensitive to minor errors in the hyperplane‟s direction [14].
  • 6. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 3 2.1 Linearly Separable Data Consider that there exist two classes and uses two labels -1 and +1 for two classes. The sample is { , }t t x r  where rt = +1 if xt ϵ C1 and rt = -1 if xt ϵ C2. To find w and w0 such that where,  represents set of n points xt represents p dimensional real vector rt represents the class (i.e. +1 or -1) 0 1 for r 1T t t w x w     0 1 for r 1T t t w x w     Which can be rewritten as: 0( ) 1t T t r w x w   (1) Here the instances are required to be on the right of the hyperplane and what them to be a distance away for better generalization. The distance from the hyperplane to the instances closest to it on either side is called the margin, which we want to maximize for best generalization. The optimal separating hyperplane is the one that maximizes the margin. The following equation represents the offset of hyperplane from the origin along the normal w. 0| | || || T t w x w w  which, when rt ϵ {+1,-1}, can be written as 0( ) || || t T t r w x w w  Consider this to be some value ρ: 0( ) , t || || t T t r w x w w     (2)
  • 7. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 4 In order to maximize ρ but there are an infinite number of solutions that are obtained by scaling w, therefore consider ρ ||w|| = 1. Thus to maximize the margin ||w|| is minimized. 2 0 1 min || || subject to r ( ) 1, 2 t T t w w x w t    (3) Figure 1 The geometry of the margin consists of the canonical hyperplanes H1 and H2. The margin is the distance between the separating (g(x) =0) and a hyperplane through the closest points (marked by a ring around the data points). The round rings are termed as support vectors. This is a standard optimization problem, whose complexity depends on d, and it can be solved directly to find w and w0. Then, on both sides of the hyperplane, there will be instances that are 1 || ||w . As there will be two margins along the sides of the hyperplane we sum it up to 2 || ||w . If the problem is not linearly separable instead of fitting a nonlinear function, one trick is to map the problem to a new space by using nonlinear basis function. Generally the new spaces has many more dimensions than the original space, and in such a case, the most interesting part is the method whose complexity does not depend on the input dimensionality. To obtain a new formulation, the Eq. (3) is written as an unconstrained problem using Lagrange multipliers αt :
  • 8. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 5 2 0 1 2 0 1 1 1 || || [ ( ) 1] 2 1 = || || ( ) + 2 N t t T t p t t t T t t t t L w r w x w w r w x w                This can be minimized with respect to w, w0 and maximized with respect to αt ≥ 0. The saddle point gives the solution. This is a convex quadratic optimization problem because the main term is convex and the linear constraints are also convex. Therefore, the dual problem is solved equivalently by making use of the Karush-Kuhn-Tucker conditions. The dual is to maximize Lp with respect to w and w0 are 0 and also that αt ≥ 0. 1 0 w = n p t t t i L r x w        (5) 10 0 w = = 0 n p t t i L r w        (6) Substituting Eq. (5) and Eq. (6) in Eq. (4), the following is obtained: 0 1 ( ) 2 T T t t t t t t d t t t L w w w r x w r        1 = - ( ) 2 t s t s t T s t t s t r x x x    (7) which can be minimized with respect to αt only, subject to the constraints 0, and 0, tt t t t r    This can be solved using the quadratic optimization methods. The size of the dual depends on N, sample size, and not on d, the input dimensionality.
  • 9. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 6 Once αt is solved only a small percentage have αt > 0 as most of them vanish with αt = 0. The set of xt whose xt > 0 are the support vectors, then w is written as weighted sum of these training instances that are selected as support vectors. These are the xt that satisfy and lie on the margin. This can be used to calculate w0 from any support vector as 0 t T t w r w x  (8) For numerical stability it is advised that this be done for all support vectors and average be taken. The discriminant thus found is called support vector machine (SVM) [1]. 3. PRINCIPAL COMPONENT ANALYSIS Principal Component Analysis (PCA) is a powerful tool for dimensionality reduction. The advantage of PCA is that if the data patterns are understood then the data is compressed by reducing the number of dimensions. The information loss is considerably less. Figure 2 Diagrammatic Representation of Principal Component Analysis (PCA)
  • 10. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 7 4. CASE STUDY An investor in stocks ideally should get maximum returns on the investment made and for that should know which stocks will do well in future. So this is the basic incentive for forecasting stock prices. For this, he has to study about different stocks, their price history, performance and reputation of the stock company, etc. So this is a broad area of study. There exists considerable evidence showing that stock returns are to some extent predictable. Most of the research is conducted using data from well established stock markets such as the US, Western Europe, and Japan. It is, thus, of interest to study the extent of stock market predictability using data from less well established stock markets such as that of India. Analysts monitor changes of these numbers to decide their trading. As long as past stock prices and trading volumes are not fully discounted by the market, technical analysis has its value on forecasting. To maximize profits from the stock market, more and more “best” forecasting techniques are used by different traders. The research data set that has been used in this study is from State Bank of India. The series spans from 10th January 2012 to 18th September 2013. The first training and testing dataset consists of 30 attributes. The second training and testing dataset consists of 5 attributes selected from the dimensionality reduction technique using Weka tool: PCA. Table 1 Number of instances in the case study State Bank of India Stock Index Total Number of Instances 400 Training Instances 300 Testing Instances 100 The purpose of this study is to predict the directions of daily change of the SBI Index. Direction is a categorical variable to indicate the movement direction of SBI Index at any time t. They are categorized as “0” or “1” in the research data. “0” means that the next day‟s index is lower than today‟s index, and “1” means that the next day‟s index is higher than today‟s index. The stock data classification is implementation with Weka 3.7.9. The k-fold cross validation is considered for the classification. In the k-fold cross- validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k – 1 subsamples are used as
  • 11. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 8 training data [15]. The cross validation variable k is set to 10 for the stock dataset [16].The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation. Figure 3 Weka Screenshot of PCA At first the model is trained with SVM and the results with the test data is saved. Second, the dimensionality reduction technique such as PCA is applied to the training dataset. The PCA selects the attributes which give more information for the stock index classification. The number of attributes for classification is now reduced from 30 attributes to 5 attributes. The most informative attributes are only being considered for classification. A new model is trained on SVM with the reduced attributes. The test data with reduces attributes is provided to the model and the result is saved. The results of both the models are compared and analysed. 5. EXPERIMENTAL RESULTS 5.1 Classification without using PCA From the tables displayed below 300 stock index instances were considered as training data and 100 stock index instances were considered as test data. With respect to the test data 43% instances were correctly classified and 57% instances were incorrectly classified.
  • 12. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 9 Table 2 Number of instances for classification without using PCA Number of Instances and Attributes Number of Train Instances Number of Test Instances Number of Attributes 300 100 30 Table 3 Classification accuracy without using PCA Classification Accuracy Correctly Classified Instances 43% Incorrectly Classified Instances 57% 5.2 Classification with PCA From the tables displayed below 300 stock index instances were considered as training data and 100 stock index instances were considered as test data. With respect to the test data 59% instances were correctly classified and 41% instances were incorrectly classified. Table 4 Number of instances for classification without using PCA Number of Instances and Attributes Number of Train Instances Number of Test Instances Number of Attributes 300 100 5 Table 5 Classification accuracy without using PCA Classification Accuracy Correctly Classified Instances 59% Incorrectly Classified Instances 41% 6. CONCLUSION The Support Vector Machines can produce accurate and robust classification results on a sound theoretical basis, even when input stock data are non-monotone and non-linearly separable. The Support Vector Machines evaluates more relevant information in a convenient way. The principal component analysis is an efficient dimensionality reduction method which gives a better SVM classification on the stock data. The SVM-PCA model analyzes the stock data with fewer and most relevant
  • 13. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 10 features. In this way a better idea about the stock data is obtained and in turn gives an efficient knowledge extraction on the stock indices. The stock data classified better with SVM-PCA model when compared to the classification with SVM alone. The SVM-PCA model also reduces the computational cost drastically. The instances are labelled with nominal values for the current case study. The future enhancement to this paper would be to use numerical values for labelling instead of nominal values. 7. ACKNOWLEDGMENTS We express our sincere gratitude to the Computer Science and Engineering Department of Christ University Faculty of Engineering especially Prof. K Balachandran for his constant motivation and support. REFERENCES [1] Divya Joseph, Vinai George Biju, “A Review of Classifying High Dimensional Data to Small Subspaces”, Proceedings of International Conference on Business Intelligence at IIM Bangalore, 2013. [2] Claudio V. Ribeiro, Ronaldo R. Goldschmidt, Ricardo Choren, A Reuse-based Environment to Build Ensembles for Time Series Forecasting, Journal of Software, Vol. 7, No. 11, Pages 2450-2459, 2012. [3] Dr. A. Chitra, S. Uma, "An Ensemble Model of Multiple Classifiers for Time Series Prediction", International Journal of Computer Theory and Engineering, Vol. 2, No. 3, pages 454-458, 2010. [4] Sundaresh Ramnath, Steve Rock, Philip Shane, "The financial analyst forecasting literature: A taxonomy with suggestions for further research", International Journal of Forecasting 24 (2008) 34–75. [5] Konstantinos Theofilatos, Spiros Likothanassis, Andreas Karathanasopoulos, Modeling and Trading the EUR/USD Exchange Rate Using Machine Learning Techniques, ETASR - Engineering, Technology & Applied Science Research Vol. 2, No. 5, pages 269-272, 2012. [6] A simulation study of artificial neural networks for nonlinear time-series forecasting. G. Peter Zhang, B. Eddy Patuwo, and Michael Y. Hu. Computers & OR 28(4):381- 396 (2001) [7] K. Kohara, T. Ishikawa, Y. Fukuhara, Y. Nakamura, Stock price prediction using prior knowledge and neural networks, Int. J. Intell. Syst. Accounting Finance Manage. 6 (1) (1997) 11–22. [8] R. Tsaih, Y. Hsu, C.C. Lai, Forecasting S& P 500 stock index futures with a hybrid AI system, Decision Support Syst. 23 (2) (1998) 161–174. [9] Mahesh Khadka, K. M. George, Nohpill Park, "Performance Analysis of Hybrid Forecasting Model In Stock Market Forecasting", International Journal of Managing Information Technology (IJMIT), Vol. 4, No. 3, August 2012. [10]Kyoung-jae Kim, “Artificial neural networks with evolutionary instance selection for financial forecasting. Expert System. Application 30, 3 (April 2006), 519-526. [11]Guoqiang Zhang, B. Eddy Patuwo, Michael Y. Hu, “Forecasting with artificial neural networks: The state of the art”, International Journal of Forecasting 14 (1998) 35–62.
  • 14. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 11 [12]K. Kim, I. Han, Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index, Expert Syst. Appl. 19 (2) (2000) 125–132. [13]F. Cai and V. Cherkassky “Generalized SMO algorithm for SVM-based multitask learning", IEEE Trans. Neural Netw. Learn. Syst., Vol. 23, No. 6, pp.997 -1003, 2012. [14]Corinna Cortes and Vladimir Vapnik, Support-Vector Networks. Mach. Learn. 20, Volume 3, 273-297, 1995. [15]Shivanee Pandey, Rohit Miri, S. R. Tandan, "Diagnosis And Classification Of Hypothyroid Disease Using Data Mining Techniques", International Journal of Engineering Research & Technology, Volume 2 - Issue 6, June 2013. [16]Hui Shen, William J. Welch and Jacqueline M. Hughes-Oliver, "Efficient, Adaptive Cross-Validation for Tuning and Comparing Models, with Application to Drug Discovery", The Annals of Applied Statistics 2011, Vol. 5, No. 4, 2668–2687, February 2012, Institute of Mathematical Statistics. This paper may be cited as: Joseph, D. and Biju, V. G., 2014. A Predictive Stock Data Analysis with SVM-PCA Model. International Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp. 1-11.
  • 15. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 12 HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space Mohammad Reza Abbasifard Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran Hassan Naderi Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran Mohadese Mirjalili Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran ABSTRACT Nearest neighbor search is one of the most important problem in computer science due to its numerous applications. Recently, researchers have difficulty to find nearest neighbors in a dynamic space. Unfortunately, in contrast to static space, there are not many works in this new area. In this paper we introduce a new nearest neighbor search algorithm (called HOV-kNN) suitable for dynamic space due to eliminating widespread preprocessing step in static approaches. The basic idea of our algorithm is eliminating unnecessary computations in Higher Order Voronoi Diagram (HOVD) to efficiently find nearest neighbors. The proposed algorithm can report k-nearest neighbor with time complexity O(knlogn) in contrast to previous work which wasO(k2 nlogn). In order to show its accuracy, we have implemented this algorithm and evaluated is using an automatic and randomly generated data point set. Keywords Nearest Neighbor search, Dynamic Space, Higher Order Voronoi Diagram. 1. INTRODUCTION The Nearest Neighbor search (NNS) is one of the main problems in computer science with numerous applications such as: pattern recognition, machine learning, information retrieval and spatio-temporal databases [1-6]. Different approaches and algorithms have been proposed to these diverse applications. In a well-known categorization, these approaches and algorithms could be divided into static and dynamic (moving points). The
  • 16. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 13 existing algorithms and approaches can be divided into three categories, based on the fact that whether the query points and/or data objects are moving. They are (i) static kNN query for static objects, (ii) moving kNNquery for static objects, and (iii) moving kNN query for moving objects [15]. In the first category data points as well as query point(s) have stationary positions [4, 5]. Most of these approaches, first index data points by performing a pre-processing operation in order to constructing a specific data structure. It’s usually possible to carry out different search algorithms on a given data structure to find nearest neighbors. Unfortunately, the pre- processing step, index construction, has a high complexity and takes more time in comparison to search step. This time could be reasonable when the space is static, because by just constructing the data structure multiple queries can be accomplished. In other words, taken time to pre-processing step will be amortized over query execution time. In this case, searching algorithm has a logarithmic time complexity. Therefore, these approaches are useful, when it’s necessary to have a high velocity query execution on large stationary data volume. Some applications need to have the answer to a query as soon as the data is accessible, and they cannot tolerate the pre-processing execution time. For example, in a dynamic space when data points are moving, spending such time to construct a temporary index is illogical. As a result approaches that act very well in static space may be useless in dynamic one. In this paper a new method, so called HOV-kNN, suitable for finding k nearest neighbor in a dynamic environment, will be presented. In k-nearest neighbor search problem, given a set P of points in a d-dimensional Euclidian space𝑅 𝑑 (𝑃 ⊂ 𝑅 𝑑 ) and a query point q (𝑞 ∈ 𝑅 𝑑 ), the problem is to find k nearest points to the given query point q [2, 7]. Proposed algorithm has a good query execution complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) without enduring from time-consuming pre-processing process. This approach is based on the well- known Voronoi diagrams (VD) [11]. As an innovation, we have changed the Fortune algorithm [13] in order to created order k Voronoi diagrams that will be used for finding kNN. The organization of this paper is as follow. Next section gives an overview on related works. In section 3 basic concepts and definitions have been presented. Section 4 our new approach HOV-kNN is explained. Our experimental results are discussed in section 5. We have finished our paper with a conclusion and future woks in section 6. 2. RELATED WORKS Recently, many methods have been proposed for k-nearest neighbor search problem. A naive solution for the NNS problem is using linear search
  • 17. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 14 method that computes distance from the query to every single point in the dataset and returns the k closest points. This approach is guaranteed to find the exact nearest neighbors [6]. However, this solution can be expensive for massive datasets. So approximate nearest neighbor search algorithms are presented even for static spaces [2]. One of the main parts in NNS problem is data structure that is roughly used in every approach. Among different data structures, various tree search most used structures which can be applied in both static and dynamic spaces. Listing proposed solutions to kNN for static space is out of scope of this paper. The interested reader can refer to more comprehensive and detailed discussions of this subject by [4, 5]. Just to name some more important structures, we can point to kd-tree, ball-tree, R-tree, R*-tree, B-tree and X- tree [2-5, 8, 9].In contrast, there are a number of papers that use graph data structure for nearest neighbor search. For example, Hajebi et al have performed Hill-climbing in kNN graph. They built a nearest neighbor graph in an offline phase, and performed a greedy search on it to find the closest node to the query [6]. However, the focus of this paper is on dynamic space. In contrast to static space, finding nearest neighbors in a dynamic environment is a new topic of research with relatively limited number of publications. Song and Roussopoulos have proposed Fixed Upper Bound Algorithm, Lazy Search Algorithm, Pre-fetching Search Algorithm and Dual Buffer Search to find k- nearest neighbors for a moving query point in a static space with stationary data points [8]. Güting et al have presented a filter-and-refine approach to kNN search problem in a space that both data points and query points are moving. The filter step traverses the index and creates a stream of so-called units (linear pieces of a trajectory) as a superset of the units required to build query’s results. The refinement step processes an ordered stream of units and determines the pieces of units forming the final precise result [9].Frentzos et al showed mechanisms to perform NN search on structures such as R-tree, TB-Tree, 3D-R-Tree for moving objects trajectories. They used depth-first and best-first algorithms in their method [10]. As mentioned, we use Voronoi diagram [11] to find kNN in a dynamic space. D.T. Lee used Voronoi diagram to find k nearest neighbor. He described an algorithm for computing order-k Voronoi diagram in 𝑂(𝑘2 𝑛𝑙𝑜𝑔𝑛) time and 𝑂(𝑘2 (𝑁 − 𝑘)) space [12] which is a sequential algorithm. Henning Meyerhenke presented and analyzed a parallel algorithm for constructing HOVD for two parallel models: PRAM and CGM [14]. In these models he used Lee’s iterative approach but his model stake 𝑂 𝑘2(𝑛−𝑘)𝑙𝑜𝑔𝑛 𝑝 running time and 𝑂(𝑘) communication rounds on a CGM
  • 18. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 15 with 𝑂( 𝑘2(𝑁−𝑘) 𝑝 ) local memory per processor [14]. p is the number of participant machines. 3. BASIC CONCEPTS AND DEFINITIONS Let P be a set of n sites (points) in the Euclidean plane. The Voronoi diagram informally is a subdivision of the plane into cells (Figure 1)which each point of that has the same closest site [11]. Figure 1.Voronoi Diagram Euclidean distance between two points p and q is denoted by 𝑑𝑖𝑠𝑡 𝑝, 𝑞 : 𝑑𝑖𝑠𝑡 𝑝, 𝑞 : = (𝑝𝑥 − 𝑞𝑥)2 + (𝑝𝑦 − 𝑞𝑦)2 (1) Definition (Voronoi diagram):Let 𝑃 = {𝑝1, 𝑝2, … , 𝑝 𝑛 } be a set of n distinct points (so called sites) in the plane. Voronoi diagram of P is defined as the subdivision of the plane into n cells, one for each site in P, with the characteristic that q in the cell corresponding to site 𝑝𝑖 if𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑖 < 𝑑𝑖𝑠𝑡 𝑞, 𝑝𝑗 for each 𝑝𝑗 ∈ 𝑃 𝑤𝑖𝑡ℎ 𝑗 ≠ 𝑖 [11]. Historically, 𝑂(𝑛2 )incremental algorithms for computing VD were known for many years. Then 𝑂 𝑛𝑙𝑜𝑔𝑛 algorithm was introduced that this algorithm was based on divide and conquer, which was complex and difficult to understand. Then Steven Fortune [13] proposed a plane sweep algorithm, which provided a simpler 𝑂 𝑛𝑙𝑜𝑔𝑛 solution to the problem. Instead of partitioning the space into regions according to the closest sites, one can also partition it according to the k closest sites, for some 1 ≤ 𝑘 ≤ 𝑛 − 1. The diagrams obtained in this way are called higher-order Voronoi diagrams or HOVD, and for given k, the diagram is called the order-k Voronoi diagram [11]. Note that the order-1 Voronoi diagram is nothing more than the standard VD. The order-(n−1) Voronoi diagram is the farthest-point Voronoi diagram (Given a set P of points in the plane, a point of P has a cell in the farthest-point VD if it is a vertex of the convex hull), because the Voronoi cell of a point 𝑝𝑖 is now the region of points for which 𝑝𝑖 is the farthest site. Currently the best known algorithms for computing the
  • 19. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 16 order-k Voronoi diagram run in 𝑂(𝑛𝑙𝑜𝑔3 𝑛 + 𝑛𝑘) time and in 𝑂(𝑛𝑙𝑜𝑔𝑛 + 𝑛𝑘2 𝑐𝑙𝑜𝑔 ∗ 𝑘 ) time, where c is a constant [11]. Figure 2. Farthest-Point Voronoi diagram [11] Consider x and y as two distinct elements of P. A set of points construct a cell in the second order Voronoi diagram for which the nearest and the second nearest neighbors are x and y. Second order Voronoi diagram can be used when we are interested in the two closest points, and we want a diagram to captures that. Figure 3.An instant of HOVD [11] 4. SUGGESTED ALGORITHM As mentioned before, one of the best algorithms to construct Voronoi diagram is Fortune algorithm. Furthermore HOVD can be used to find k- nearest neighbors [12]. D.T. Lee used an 𝑂 𝑘2 𝑛𝑙𝑜𝑔𝑛 algorithm to construct a complete HOVD to obtain nearest neighbors. In D.T. Lee's algorithm, at first the first order Voronoi diagram is obtained, and then finds the region of diagram that contains query point. The point that is in this region is defined as a first neighbor of query point. In the next step of Lee’s algorithm, this nearest point to the query will be omitted from dataset, and this process will be repeated. In other words, the Voronoi diagram is built on the rest of points. In the second repetition of this process, the second neighbor is found and so on. So the nearer neighbors to a given query point are found sequentially.
  • 20. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 17 However we think that nearest neighbors can be finding without completing the process of HOVD construction. More precisely, in Lee’s algorithm each time after omitting each nearest neighbor, next order of Voronoi diagram is made completely (edges and vertices) and then for computing a neighbor performs the search algorithm. In contrast, in our algorithm, the vertices of Voronoi diagram are only computed and the neighbors of the query are found during process of vertices computing. So in our algorithm, the overhead of edge computing to find neighbors is effectively omitted. As we will show later in this paper, by eliminating this superfluous computation a more efficiently algorithm in term of time complexity will be obtained. We use Fortune algorithm to create Voronoi diagram. Because of space limitation in this paper we don’t describe this algorithm and the respectable readers can refer to [11, 13]. By moving sweep line in Fortune algorithm, two set of events are emerged; site event and circle event [11]. To find k nearest neighbors in our algorithm, the developed circle events are employed. There are specific circle events in the algorithm that are not actual circle events named false alarm circle events. Our algorithm (see the next section) deals efficiently with real circle events and in contrast doesn't superfluously consider the false alarm circle event. A point on the plane is inside a circle when its distance from the center of the circle is less than radius of the circle. The vertices of a Voronoi diagram are the center of encompassing triangles where each 3 points (sites) constitute the triangles. The main purpose of our algorithm is to find out a circle in which the desired query is located. As the proposed algorithm does not need pre-processing, it’s completely appropriate for dynamic environment where we can't endure very time consuming pre-processing overheads. Because, as the readers may know, in k-NN search methods a larger percent of time is dedicated to constructing a data structure (usually in the form of a tree). This algorithm can be efficient, especially when there are a large number of points while their motion is considerable. 4.1 HOV-kNN algorithm After describing our algorithm in the previous paragraph briefly, we will elaborate it formally in this section. When the first order Voronoi diagram is constructed, some of the query neighbors can be obtained in complexity of the Fortune algorithm (i.e.𝑂(𝑛𝑙𝑜𝑔𝑛)). This fact forms the first step of our algorithm. When the discovered circle event in HandleCircleEvent of the Fortune algorithm is real (initialized by the variable “check” in line 6 of the algorithm, and by default function HandleCircleEvent returns “true” when circle even is real) the query distance is measured from center of the circle. Moreover, when the condition in line 7.i of the algorithm is true, the three points that constitute the circle are added to NEARS list if not been added
  • 21. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 18 before (function PUSH-TAG (p) shows whether it is added to NEAR list or not). 1) Input : q , a query 2) Output: list NEARS, k nearest neighbors. 3) Procedure : 4) Initialization : 5) NEARS ={}, K nearest neighbors , Check = false, MOD = 0, V = {} (hold Voronoipoints( ; 6) Check = HandleCircleEvent() 7) If check= true, then -- detect a true circle event. i) If distance(q , o) < r Then (1) If PUSH-TAG(p1) = false , Then (a) add p1 to NEARS (2) If PUSH-TAG (p2) = false , Then (a) add p2 to NEARS ii) If PUSH-TAG(p3) = false, Then (a) add p3 to NEARS Real circle events are discovered up to this point and the points that constitute the events are added to neighbor list of the query. As pointed out earlier, the preferred result is obtained, if “k” inputs are equal or lesser than number of the obtained neighbors a𝑂(𝑛𝑙𝑜𝑔𝑛)complexity. 8) if SIZE (NEARS) >= k , then a. sort (NERAS ) - - sort NEARS by distance b. for i = 1 to k i. print (NEARS); 9) else if SIZE (NEARS) = k ii. print(NEARS); The algorithm enters the second step if the conditions of line 8 and 9 in the first part are not met. The second part compute vertices of Voronoi sequentially, so that the obtained vertices are HOV vertex. Under sequential method for developing HOV [12], the vertices of the HOV are obtained by omitting the closer neighbors. Here, however, to find more neighbors through sequential method, loop one of the closest neighbor and loop one of the farthest neighbor are deleted alternatively from the set of the point. This leads to new circles that encompass the query. Afterward, the same calculations described in section one are carried out for the remaining points (the removed neighbors are recorded a list named REMOVED_POINTS). The calculations are carried out until the loop condition in line 5 is met. 10) Else if (SIZE(NEARS) < k ) c. if mod MOD 2 = 0 , then i. add nearest_Point to REMOVED_POINT ; ii. Remove(P,nearest_Point); d. if mod MOD 2 = 1 , then
  • 22. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 19 i. add farthest_Point to REMOVED_POINT ; ii. Remove(P,nearest_Point); 11) Increment MOD ; 12) produce line 6 to 9 from part1 for remind points P ; 13) Repeat until k >= SIZE _ LIST (NEARS) + SIZE _ LIST (REMOVED_POINT) ; 14) PRINT (NEARS) ; Should the number of neighbors be less than required number of neighbors, the algorithm starts the third part. At this part, Voronoi vertices and their distance from query are recorded in a list. As explained for the first part of the algorithm, the Voronoi vertices in the Fortune algorithm and their distance to the query are enough to check realization of the condition of line 8. The vertices and their distance to the query are recorded. Following line will be added after line 7 in the first part: add pair(Voronoi_Vertex ,distance_To_Query) to List V Moreover, along with adding input point to the list of the neighbors, their distance to the query must be added to the list. Using these two lists (after being filled, the lists can be ranked based on their distance to query) the nearest point or Voronoi vertices is obtainable. The nearest point can be considered as the input query and the whole process of 1st and 2nd parts of the algorithm is repeated until required number of neighbors is achieved. Finally, to have more number of neighbors, the method can be repeated sequentially over the closer points to the query. This part of the algorithm has the same complexities of the two other sections as the whole process to find the preliminary query is repeated for the representatives of the query. Figure 4.implementation of HOVD In Figure 4 "o" is a vertex of Voronoi and a center point of circle event that is created by 𝑝1, 𝑝2 and 𝑝3. Based on algorithm the circle that encompasses the query, add 𝑝1, 𝑝2 and 𝑝3 points as neighbors of query to the neighbors' list. Here k is near to n, so by computing higher order of Voronoi, the circle will be bigger and bigger. Thus farther neighbors are added to query neighbors' list.
  • 23. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 20 4.2 The complexity of HOV-kNN As mentioned before, HOV-kNN algorithm has a time complexity lesser than the time complexity of D.T. Lee’s algorithm. To show this fact, consider the presented algorithm in the previous section. Line 13 explains that the main body of algorithm must be repeated k times in which "k" are the number of neighbors that should be found. In each repetition one of the query’s neighbors are detected by algorithm and subsequently eliminated from dataset. The principle part of our algorithm that is the most time consuming part too is between lines 6 and 9. This line recalls modified Fortune algorithm which has a time complexity𝑂(𝑛𝑙𝑜𝑔𝑛). Therefore the overall complexity of our algorithm will be: 𝑂 𝑛𝑙𝑜𝑔𝑛 𝑘 𝑖=1 = 𝑂 𝑛𝑙𝑜𝑔𝑛 1 𝑘 𝑖=0 = 𝑘𝑂 𝑛𝑙𝑜𝑔𝑛 = 𝑂 𝑘𝑛𝑙𝑜𝑔𝑛 (2) In comparison to the algorithm introduced in [12] (which has a time complexity𝑂(𝑘2 𝑛𝑙𝑜𝑔𝑛)) our algorithm is faster k times. The main reason of this difference is that Lee’s algorithm completely computes the HOVD, while ours exploits a fraction of HOVD construction process. In term of space complexity, the space complexity of our algorithm is the same as the space complexity of Fortune algorithm: 𝑂(𝑛). 5. IMPLEMENTATION AND EVALUATION This section introduces the results of the HOV-kNN algorithm and compares the results with other algorithms. We use Voronoi diagram which is used to find k nearest neighbor points that is less complicated. The proposed algorithm was implemented using C++. For maintaining data points vector data structure, which is one of the C++ standard libraries, was used. The input data points used in the program test were adopted randomly. To reach preferred data distribution, not too close/far points, they were generated under specific conditions. For instance, for 100 input points, the point generation range is 0-100 and for 500 input points the range is 0-500. To ensure accuracy and validity of the output, a simple kNN algorithm was implemented and the outputs of the two algorithms were compared (equal input, equal query). Outputs evaluation was also carried out sequentially and the outputs were stored in two separate files. Afterward, to compare similarity rate, the two files were used as input to another program. The evaluation was also conducted in two steps. First the parameter “k” was taken as a constant and the evaluation was performed using different points of data as input. As pictured in Figure 5, accuracy of the algorithm is more than 90%. In this diagram, the number of inputs in dataset varies between 10 and 100000. At the second step, the evaluation was conducted with different values of k, while the number of input data was stationary. Accuracy of the algorithm was obtained 74% while “k” was between 10 and 500 (Figure 6).
  • 24. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 21 Figure 5. The accuracy of the algorithm for constant k and different points of data as input Figure 6. The accuracy of the algorithm for variable k and constant data as input 6. CONCLUSION AND FUTURE WORK We have introduced a new algorithm (named HOV-kNN) with time complexity 𝑂(𝑘𝑛𝑙𝑜𝑔𝑛) and computing order k Voronoi diagram to find k nearest neighbor in a set of N points in Euclidean space. The new proposed algorithm finds k nearest neighbors in two stages: 1) during constructing the first order Voronoi diagram, some of the query neighbors can be obtained in complexity of the Fortune algorithm; 2) computing vertices of Voronoi sequentially. Because of eliminating pre-processing steps, this algorithm is significantly suitable for dynamic space in which data points are moving. The experiments are done in twofold: 1) constant number of data points while k is variable, and 2) variable number of data points while k is constant. The obtained results show that this algorithm has sufficient accuracy to be applied in real situation. In our future work we will try to give a parallel version of our algorithm in order to efficiently implementation a parallel machine to obtain more speed implementation. Such an algorithm will be appropriate when the numbers of input points are massive and probably distributed on a network of computers. 0% 20% 40% 60% 80% 100% 50 200 350 500 2000 5000 8000 20000 50000 80000 percent input data Accuracy 0% 20% 40% 60% 80% 100% 0 100 200 300 400 500 percent k Accuracy
  • 25. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 22 REFERENCES [1] Lifshits, Y.Nearest neighbor search: algorithmic perspective, SIGSPATIAL Special. Vol. 2, No 2, 2010, 12-15. [2] Shakhnarovich, G., Darrell, T., and Indyk, P.Nearest Neighbor Methods in Learning and Vision: Theory and Practice, The MIT Press, United States, 2005. [3] Andoni, A.Nearest Neighbor Search - the Old, the New, and the Impossible, Doctor of Philosophy, Electrical Engineering and Computer Science, Massachusetts Institute of Technology,2009. [4] Bhatia, N., and Ashev, V. Survey of Nearest Neighbor Techniques, International Journal of Computer Science and Information Security, Vol. 8, No 2, 2010, 1- 4. [5] Dhanabal, S., and Chandramathi, S. A Review of various k-Nearest Neighbor Query Processing Techniques, Computer Applications, Vol. 31, No 7, 2011, 14-22. [6] Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., and Zhang, H.Fast approximate nearest- neighbor search with k-nearest neighbor graph, In Proceedings of 22 international joint conference on Artificial Intelligence, Vol. 2 (IJCAI'11), Toby Walsh (Ed.), 2011, 1312- 1317. [7] Fukunaga, K. Narendra, P. M. A Branch and Bound Algorithm for Computing k- Nearest Neighbors, IEEE Transactions on Computer,Vol. 24, No 7, 1975, 750-753. [8] Song, Z., Roussopoulos, N. K-Nearest Neighbor Search for Moving Query Point, In Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases (Redondo Beach, California, USA), Springer-Verlag, 2001, 79-96. [9] Güting, R., Behr, T., and Xu, J. Efficient k-Nearest Neighbor Search on moving object trajectories, The VLDB Journal 19, 5, 2010, 687-714. [10]Frentzos, E., Gratsias, K., Pelekis, N., and Theodoridis, Y.Algorithms for Nearest Neighbor Search on Moving Object Trajectories, Geoinformatica 11, 2, 2007,159-193. [11]Berg, M. , Cheong, O. , Kreveld, M., and Overmars, M.Computational Geometry: Algorithms and Applications, Third Edition, Springer-Verlag, 2008. [12]Lee, D. T. On k-Nearest Neighbor Voronoi Diagrams in the Plane, Computers, IEEE Transactions on Volume:C-31, Issue:6, 1982, 478–487. [13]Fortune, S. A sweep line algorithm for Voronoi diagrams, Proceedings of the second annual symposium on Computational geometry, Yorktown Heights, New York, United States, 1986, 313–322. [14]Meyerhenke, H. Constructing Higher-Order Voronoi Diagrams in Parallel, Proceedings of the 21st European Workshop on Computational Geometry, Eindhoven, The Netherlands, 2005, 123-126. [15]Gao, Y., Zheng, B., Chen, G., and Li, Q. Algorithms for constrained k-nearest neighbor queries over moving object trajectories, Geoinformatica 14, 2 (April 2010 ), 241-276. This paper may be cited as: Abbasifard, M. R., Naderi, H. and Mirjalili, M., 2014. HOV-kNN: A New Algorithm to Nearest Neighbor Search in Dynamic Space. International Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp. 12-22.
  • 26. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 23 A Survey on Mobile Malware: A War without End Sonal Mohite Sinhgad College of Engineering, Vadgaon. Pune, India. Prof. R. S. Sonar Associate Professor Sinhgad College of Engineering, Vadgaon. Pune, India. ABSTRACT Nowadays, mobile devices have become an inseparable part of our everyday lives and its usage has grown up exponentially. With the functionality upgrade of mobile phones, the malware threat for mobile phones is expected to increase. This paper shades a light on when and how the mobile malware got evolved. Current scenario of mobile operating system shares’ and number and types of mobile malware are also described. Mobile malware can be propagated via three communication media viz. SMS/MMS, Bluetooth/Wi- Fi and FM-RDS. Several mobile malware detection techniques are explained with implemented examples. When one uses the particular malware detection technique is clarified along with its pros & cons. At first, static analysis of application is done and then a dynamic analysis. If external ample resources are available then cloud-based analysis is chosen. Application permission analysis and battery life monitoring are novel approaches of malware detection. Along with malware detection, preventing mobile malware has become critical. Proactive and reactive techniques of mobile malware control are defined and explained. Few tips are provided to restrain malware propagation. Ultimately, Structured and comprehensive overview of the research on mobile malware is explored. Keywords Mobile malware, malware propagation, malware control, malware detection. 1. INTRODUCTION Before decades, computers were the only traditional devices used for computing. Here and now, smart phones are used as supporting computing devices with computers. With the increasing capabilities of such phones, malware which was the biggest threat for computers is now become widespread for smart phones too. The damage made by mobile malwares includes theft of confidential data from device, eavesdropping of ongoing conversation by third party, incurring extra charges through sending SMS to premium rate numbers, and even location based tracking of user, which is too severe to overlook. So there is a judicious requirement of understanding the propagation means of mobile malware, various techniques to detect mobile malware, and malware restraint.
  • 27. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 24 2. RELATED WORKS Malware is a malicious piece of software which is designed to damage the computer system & interrupt its typical working. Fundamentally, malware is a short form of Malicious Software. Mobile malware is a malicious software aiming mobile phones instead of traditional computer system. With the evolution of mobile phones, mobile malware started its evolution too [1-4]. When propagation medium is taken into account, mobile viruses are of three types: Bluetooth-based virus, SMS-based virus, and FM RDS based virus [5-9]. A BT-based virus propagates through Bluetooth & Wi-Fi which has regional impact [5], [7], and [8]. On the contrary, SMS-based virus follows long-range spreading pattern & can be propagated through SMS & MMS [5], [6], [8]. FM RDS based virus uses RDS channel of mobile radio transmitter for virus propagation [9]. Our work addresses the effect of operational behavior of user & mobility of a device in virus propagation. There are several methods of malware detection viz. static method, dynamic method, cloud-based detection method, battery life monitoring method, application permission analysis, enforcing hardware sandbox etc. [10-18]. In addition to work given in [10-18], our work addresses pros and cons of each malware detection method. Along with the study of virus propagation & detection mechanisms, methods of restraining virus propagation are also vital. A number of proactive & reactive malware control strategies are given in [5], [10]. 3. EVOLUTION OF MOBILE MALWARE Although, first mobile malware, ‘Liberty Crack’, was developed in year 2000, mobile malware evolved rapidly during years 2004 to 2006 [1]. Enormous varieties of malicious programs targeting mobile devices were evolved during this time period & are evolving till date. These programs were alike the malware that targeted traditional computer system: viruses, worms, and Trojans, the latter including spyware, backdoors, and adware. At the end of 2012, there were 46,445 modifications in mobile malware. However, by the end of June 2013, Kaspersky Lab had added an aggregate total of 100,386 mobile malware modifications to its system [2]. The total mobile malware samples at the end of December 2013 were 148,778 [4]. Moreover, Kaspersky labs [4] have collected 8,260,509 unique malware installation packs. This shows that there is a dramatic increase in mobile malware. Arrival of ‘Cabir’, the second most mobile malware (worm) developed in 2004 for Symbian OS, dyed-in-the-wool the basic rule of computer virus evolution. Three conditions are needed to be fulfilled for malicious programs to target any particular operating system or platform:
  • 28. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 25  The platform must be popular: During evolution of ‘Cabir’, Symbian was the most popular platform for smart phones. However, nowadays it is Android, that is most targeted by attackers. These days’ malware authors continue to ponder on the Android platform as it holds 93.94% of the total market share in mobile phones and tablet devices.  There must be a well-documented development tools for the application: Nowadays every mobile operating system developers provides a software development kit & precise documentation which helps in easy application development.  The presence of vulnerabilities or coding errors: During the evolution of ‘Cabir’, Symbian had number of loopholes which was the reason for malware intrusion. In this day and age, same thing is applicable for Android [3]. Share of operating system plays a crucial role in mobile malware development. Higher the market share of operating system, higher is the possibility of malware infection. The pie chart below illustrates the operating system (platform) wise mobile malware distribution [4]: Figure 1. OS wise malware distribution
  • 29. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 26 4. MOBILE MALWARE PROPAGATION There are 3 communication channels through which malware can propagate. They are: SMS / MMS, Bluetooth / Wi-Fi, and FM Radio broadcasts. 4.1 SMS / MMS Viruses that use SMS as a communication media can send copies of themselves to all phones that are recorded in victim’s address book. Virus can be spread by means of forwarding photos, videos, and short text messages, etc. For propagation, a long-range spreading pattern is followed which is analogous to the spreading of computer viruses like worm propagation in e-mail networks [6]. For accurate study of SMS-based virus propagation, one needs to consider certain operational patterns, such as whether or not users open a virus attachment. Hence, the operational behavior of users plays a vital role in SMS-based virus propagation [8]. 4.1.1 Process of malware propagation If a phone is infected with SMS-based virus, the virus regularly sends its copies to other phones whose contact number is found in the contact list of the infected phone. After receiving such distrustful message from others, user may open or delete it as per his alertness. If user opens the message, he is infected. But, if a phone is immunized with antivirus, a newly arrived virus won’t be propagated even if user opens an infected message. Therefore, the security awareness of mobile users plays a key role in SMS- based virus propagation. Same process is applicable for MMS-based virus propagation whereas MMS carries sophisticated payload than that of SMS. It can carry videos, audios in addition to the simple text & picture payload of SMS. 4.2 Bluetooth/ Wi-Fi Viruses that use Bluetooth as a communication channel are local-contact driven viruses since they infect other phones within its short radio range. BT-based virus infects individuals that are homogeneous to sender, and each of them has an equal probability of contact with others [7]. Mobility characteristics of user such as whether or not a user moves at a given hour, probability to return to visited places at the next time, traveling distances of a user at the next time etc. are need to be considered [8]. 4.2.1 Process of malware propagation Unlike SMS-based viruses, if a phone is infected by a BT-based virus, it spontaneously & atomically searches another phone through available Bluetooth services. Within a range of sender mobile device, a BT-based virus is replicated. For that reason, users’ mobility patterns and contact
  • 30. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 27 frequency among mobile phones play crucial roles in BT-based virus propagation. Same process is followed for Wi-Fi where Wi-Fi is able to carry high payload in large range than that of BT. 4.3 FM-RDS Several existing electronic devices do not support data connectivity facility but include an FM radio receiver. Such devices are low-end mobile phones, media players, vehicular audio systems etc. FM provides FM radio data system (RDS), a low-rate digital broadcast channel. It is proposed for delivering simple information about the station and current program, but it can also be used with other broad range of new applications and to enhance existing ones as well [9]. 4.3.1 Process of malware propagation The attacker can attack in two different ways. The first way is to create a seemingly benign app and upload it to popular app stores. Once the user downloads & installs the app, it will contact update server & update its functionality. This newly added malicious functionality decodes and assembles the payload. At the end, the assembled payload is executed by the Trojan app to uplift privileges of attacked device & use it for malicious purpose. Another way is, the attacker obtains a privilege escalation exploit for the desired target. As RDS protocol has a limited bandwidth, we need to packetize the exploit. Packetization is basically to break up a multi-kilobyte binary payload into several smaller Base64 encoded packets. Sequence numbers are attached for proper reception of data at receiver side. The received exploit is executed. In this way the device is infected with malware [9]. 5. MOBILE MALWARE DETECTION TECHNIQUE Once the malware is propagated, malware detection is needed to be carried out. In this section, various mobile malware detection techniques are explained. 5.1 Static Analysis Technique As the name indicates, static analysis is to evaluate the application without execution [10-11]. It is an economical as well as fast approach to detect any malevolent characteristics in an application without executing it. Static analysis can be used to cover static pre-checks that are performed before the application gets an entry to online application markets. Such application markets are available for most major smartphone platforms e.g. ‘Play store’ for Android, ‘Store’ for windows operating system. . These extended pre-
  • 31. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 28 checks enhance the malware detection probabilities and therefore further spreading of malware in the online application stores can be banned. In static analysis, the application is investigated for apparent security threats like memory corruption flaws, bad code segment etc. [10], [12]. 5.1.1 Process of malware detection If the source code of application is available, static analysis tools can be directly used for further examination of code. But if the source code of the application is not available then executable app is converted back to its source code. This process is known as disassembling. Once the application is disassembled, feature extraction is done. Feature extraction is nothing but observing certain parameters viz. system calls, data flow, control flow etc. Depending on the observations, anomaly is detected. In this way, application is categorized as either benign or malicious. Pros: Economical and fast approach of malware detection. Cons: Source codes of applications are not readily available. And disassembling might not give exact source codes. Figure 2. Static Analysis Technique 5.1.2 Example Figure 2 shows the malware detection technique proposed by Enck et al. [12] for Android. Application’s installation image (.apk) is used as an input to system. Ded, a Dalvik decompiler, is used to dissemble the code. It
  • 32. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 29 generates Java source code from .apk image. Feature extraction is done by using Fortify SCA. It is a static code analysis suite that provides four types of analysis; control flow analysis, data flow analysis, structural analysis, and semantic analysis. It is used to evaluate the recovered source code & categorize the application as either benign or malicious. 5.2 Dynamic Analysis Technique Dynamic analysis comprises of analyzing the actions performed by an application while it is being executed. In dynamic analysis, the mobile application is executed in an isolated environment such as virtual machine or emulator, and the dynamic behavior of the application is monitored [10], [11], [13]. There are various methodologies to perform dynamic analysis viz. function call monitoring, function parameter analysis, Information flow tracking, instruction trace etc. [13]. 5.2.1 Process of malware detection Dynamic analysis process is quite diverse than the static analysis. In this, the application is installed in the standard Emulator. After installation is done, the app is executed for a specific time and penetrated with random user inputs. Using various methodologies mentioned in [13], the application is examined. On the runtime behavior, the application is either classified as benign or malicious. Pros: Comprehensive approach of malware detection. Most of the malwares is got detected in this technique. Cons: Comparatively complex and requires more resources. Figure 3. Dynamic Analysis Technique
  • 33. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 30 5.2.2 Example Figure 3 shows Android Application Sandbox (AASandbox) [14], the dynamic malware detection technique proposed by Blasing et al. for Android. It is a two-step analysis process comprising of both static & dynamic analysis. The AASandbox first implements a static pre-check, followed by a comprehensive dynamic analysis. In static analysis, the application image binary is disassembled. Now the disassembled code is used for feature extraction & to search for any distrustful patterns. After static analysis, dynamic analysis is performed. In dynamic analysis, the binary is installed and executed in an AASandbox. ‘Android Monkey’ is used to generate runtime inputs. System calls are logged & log files are generated. This generated log file will be then summarized and condensed to a mathematical vector for better analysis. In this way, application is classified as either benign or malicious. 5.3 Cloud-based Analysis Technique Mobile devices possess limited battery and computation. With such constrained resource availability, it is quite problematic to deploy a full- fledged security mechanism in a smartphone. As data volume increases, it is efficient to move security mechanisms to some external server rather than increasing the working load of mobile device [10], [15]. 5.3.1 Process of malware detection In the cloud-based method of malware detection, all security computations are moved to the cloud that hosts several replicas of the mobile phones running on emulators & result is sent back to mobile device. This increases the performance of mobile devices. Pros: Cloud holds ample resources of each type that helps in more comprehensive detection of malware. Cons: Extra charges to maintain cloud and forward data to cloud server. 5.3.2 Example Figure 4 shows Paranoid Android (PA), proposed by Portokalidis et al. [15]. Here, security analysis and computations are moved to a cloud (remote server). It consists of 2 different modules, a tracer & replayer. A tracer is located in each smart phone. It records all necessary information that is required to reiterate the execution of the mobile application on remote server. The information recorded by tracer is first filtered & encoded. Then it is stored properly and synchronized data is sent to replayer over an encrypted channel. Replayer is located in the cloud. It holds the replica of mobile phone running on emulator & records the information communicated by tracer. The replayer replays the same execution on the emulator, in the
  • 34. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 31 cloud. Cloud, the remote server, owns abundant resources to perform multifarious analysis on the data collected from tracer. During the replay, numerous security analyses such as dynamic malware analysis, memory scanners, system call tracing, call graph analysis[15] etc. are performed rather there is no limit on the number of attack detection techniques that we can be applied in parallel. Figure 4. Cloud-based Detection Technique 5.4 Monitoring Battery Consumption Monitoring battery life is a completely different approach of malware detection compared to other ones. Usually smartphones possess limited battery capacity and need to be used judiciously. The usual user behavior, existing battery state, signal strength and network traffic details of a mobile is recorded over time and this data can be effectively used to detect hidden malicious activities. By observing current energy consumption such malicious applications can indeed be detected as they are expected to take in more power than normal regular usage. Though, battery power consumption is one of the major limitations of mobile phones that limit the complexity of anti-malware solutions. A quite remarkable work is done in this field. The introductory exploration in this domain is done by Jacoby and Davis [16]. 5.4.1 Process of malware detection After malware infection, that greedy malware keeps on repeating itself. If the mean of propagation is Bluetooth then the device continuously scans for
  • 35. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 32 adjacent Bluetooth-enabled devices which in turn consume a remarkable amount of power. This time-domain data of power consumption collected over a period of time is transformed into frequency-domain data & represented as dominant frequencies. The malwares are identified from these certain dominant frequencies. Pros: Economical and novel approach of malware detection. Cons: Because of multi-functionality of smart phones, power consumption model of smart phone could not be accurately defined. 5.4.2 Example Recent work by Liu et al. [17] proposed another detection technique by comparing the compressed sequences of the power consumption value in each time interval. They defined a user-centric power model that relies on user actions. User actions such as duration & frequency of calls, number of SMS, network usage are taken into account. Their work uses machine learning techniques to generate rules for malware detection. 5.5 Application Permission Analysis With the advancements in mobile phone technology, users have started downloading third party application. These applications are available in third party application stores. While developing any application, application developers need to take required permissions from device in order to make the application work on that device. Permissions hold a crucial role in mobile application development as they convey the intents and back-end activities of the application to the user. Permissions should be precisely defined & displayed to the user before the application is installed. Though, some application developers hide certain permissions from user & make the application vulnerable & malicious application. 5.5.1 Process of malware detection Security configuration of an application is extracted. Permissions taken by an application are analyzed. If application has taken any unwanted applications then it is categorized as malicious. Pros: Fewer resources are required compared to other techniques. Cons: Analyzing only the permissions request is not adequate for mobile malware detection; it needs to be done in parallel with static and/or dynamic analysis. 5.5.2 Example Kirin, proposed by Enck et al. (2009) [18] is an application certification system for Android. During installation, Kirin crisscrosses the application permissions. It extracts the security configurations of the application
  • 36. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 33 &checks it against the templates i.e. security policy rules already defined by Kirin. If any application becomes unsuccessful to clear all the security policy rules, Kirin either deletes the application or alerts the user for assistance [18]. 6. MOBILE MALWARE CONTROL STRATEGIES Basically, there are two types of malware control strategies, viz. proactive & reactive control. In proactive malware control strategy, malware is mitigated before its propagation. Proper set of preventive measures is used for this purpose. While, in reactive malware control strategy, malware is first propagated and then a reaction is taken upon malware contamination. 6.1 Proactive Malware Control Strategy Here are some of the proactive malware control techniques given in [10]; however, users’ own security awareness plays a crucial role.  Install a decent mobile security application i.e. antivirus.  Always download apps from trusted official application markets. Before downloading any app, do read the reviews and ratings of the app. During installation, always remember to read the permissions requested by the app and if it appears doubtful don’t install it. Always keep installed apps up-to-date.  Turn-off Wi-Fi, Bluetooth, and other short range wireless communication media when not to be used. Stay more conscious when connecting to insecure public Wi-Fi networks & accepting Bluetooth data from unknown sender.  When confidential data is to be stored in the mobile phone, encrypt it before storing and set a password for access. Do regular back-ups. Assure that the sensitive information is not cached locally in the mobile phone.  Always keep an eye on the battery life, SMS and call charges, if found any few and far between behaviors, better go for an in-depth check on the recently installed applications.  During internet access, don’t click on links that seem suspicious or not trustworthy.  Finally, in case of mobile phone theft, delete all contacts, applications, and confidential data remotely. 6.2Reactive Malware Control Strategy When the malware is detected then the control strategy is implemented, is the working principle of reactive malware control strategy. Antivirus solution comes under proactive malware control, however when a new
  • 37. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 34 malware is found, antivirus updates for that malware are implemented and forwarded to mobile phones, is a part of reactive malware control. This is known as adaptive patch dissemination. Adaptive Patch Dissemination A pre-immunization like antivirus is used to protect networks before virus propagation. However, in reality, we first detect certain viruses and then update antivirus, known as patches. These patches are forwarded into networks only after these viruses have already propagated. Network bandwidth limits the speed with which the security notifications or patches can be sent to all users simultaneously. Therefore, a new strategy namely adaptive dissemination strategy is developed. It is based on the Autonomy Oriented Computing (AOC) methodology which helps to send security notifications or patches to most of phones with a relatively lower communication cost. The AOC is used to search a set of the highly connected phones with large communication abilities in a mobile network [5]. 7. CONCLUSION Rapid growth in smart phone development resulted in evolution of mobile malware. Operating system shares’ plays crucial role in malware evolution. SMS/MMS is the fastest way of mobile malware propagation as it has no geographical boundary like BT/Wi-Fi. FM-RDS is still evolving. Among all malware detection techniques, static malware detection is performed first during pre-checks. Later dynamic analysis is performed and can be combined with application permission analysis. Cloud-based analysis is more comprehensive approach as it uses external resources to perform malware detection and can perform more than one type of analysis simultaneously. Proactive control strategy is used to control malware before its propagation while reactive control strategy is used after malware is propagated. REFERENCES [1] La Polla, M., Martinelli, F., & Sgandurra, D. (2012). A survey on security for mobile devices. IEEE Communications Surveys & Tutorials, 15(1), 446 – 471. [2] Kaspersky Lab IT Threat Evolution: Q2 2013. (2013). Retrieved from http://www.kaspersky.co.in/about/news/virus/2013/kaspersky_lab_it_threat_evolution_q2_ 2013. [3] Kaspersky Security Bulletin 2013: Overall statistics for 2013. (2013 December). Retrieved from http://www.securelist.com/en/analysis/204792318/Kaspersky_Security_Bulletin_2013_Ove rall_statistics_for_2013.
  • 38. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 35 [4] Maslennikov, D. Mobile Malware Evolution: Part 6. (2013 February). Retrieved from http://www.securelist.com/en/analysis/ 204792283/Mobile_Malware_Evolution_Part_6. [5] Gao, C., and Liu, J. (2013). Modeling and restraining mobile virus propagation. IEEE transactions on mobile computing, 12(3), 529-541. [6] Gao, C. and Liu, J. (2011). Network immunization and virus propagation in Email networks: Experimental evaluation and analysis. Knowledge and information systems, 27(2), 253-279. [7] Yan, G., and Eidenbenz, S. (2009, March). Modeling propagation dynamics of Bluetooth worms (extended version). IEEE transactions on Mobile Computing, 8(3), 353- 368. [8] Gonzalez, M., Hidalgo, C., and Barabasi, A. (2008). Understanding individual human mobility patterns. Nature, 453(7196), 779-782. [9] Fernandes, E., Crispo, B., Conti, M. (2013, June). FM 99.9, Radio virus: Exploiting FM radio broadcasts for malware deployment. Transactions on information forensics and security, 8(6), 1027-1037. [10] Chandramohan, M., and Tan, H. (2012). Detection of mobile malware in the wild. IEEE computer society, 45(9), 65-71. [11] Yan, Q., Li, Y., Li, T., and Deng, R. (2009). Insights into malware detection and prevention on mobile phones. Springer-Verlag Berlin Heidelberg, SecTech 2009, 242–249. [12] Enck, W., Octeau, D., Mcdaniel, P., and Chaudhuri, S. (2011 August). A study of android application security. The 20th Usenix security symposium. [13] Egele, M., Scholte, T., Kirda, E., Kruegel, C. (2012 February). A survey on automated dynamic malware-analysis techniques and tools. ACM-TRANSACTION, 4402(06), 6-48. [14] Blasing, T., Batyuk, L., Schmidt, A., Camtepe, S., and Albayrak, S. (2010). An android application sandbox system for suspicious software detection. 5th International Conference on Malicious and Unwanted Software. [15] Portokalidis, G., Homburg, P., Anagnostakis, K., Bos, H. (2010 December). Paranoid android: Versatile protection for smartphones. ACSAC'10. [16] Jacoby, G. (2004). Battery-based intrusion detection. The Global Telecommunications Conference. [17] Liu, L., Yan, G., Zhang, X., and Chen, S. (2009). Virusmeter: Preventing your cellphone from spies. RAID, 5758, 244-264. [18] Enck, W., Ongtang, M., and Mcdaniel, P. (2009 November). On lightweight mobile phone application certification. 16th ACM Conference on Computer and Communications Security. This paper may be cited as: Mohite, S. and Sonar, R. S., 2014. A Survey on Mobile Malware: A War without End. International Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp. 23-35.
  • 39. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 36 An Efficient Design Tool to Detect Inconsistencies in UML Design Models Mythili Thirugnanam Assistant Professor (Senior) School of Computing Science and Engineering VIT University,Vellore, Tamil Nadu Sumathy Subramaniam Assistant Professor (SG) School of Information Technology and Engineering VIT University, Vellore, Tamil Nadu ABSTRACT Quality of any software developed is evaluated based on the design aspect. Design is one of the most important phases in software life cycle. Poor process design leads to high failure rate of the software. To design the software, various traditional and UML models are widely used. There are many tools proposed and are available to design the UML models as per the user requirements. However, these tools do not support validation of UML models which, ultimately leads to design errors. Most of the existing testing tools check for consistency of the UML models. Some tools check for inconsistency of the UML models that does not follow the consistency rule required for UML models. The proposed work aims to develop an efficient tool, which detects the inconsistency in the given UML models. Parsing techniques are applied to extract the XML tags. The extracted tags contain relevant details such as class name, attribute name, operation name and the association with their corresponding names in Class diagram in the Meta model format. On adopting the consistency rules for the given input UML model, inconsistency is detected and a report is generated. From the inconsistency report, error efficiency and design efficiency is computed. Keywords Software Design, Unified Modeling Language (UML), Testing, Extensible Markup Language (XML). 1. INTRODUCTION In present day scenario, software programming is moving towards high- level design, which raises new research issues and a scope for developing new set of tools that supports design specification. Most research in software specification use verification and validation techniques to prove correctness in terms of certain properties. The delivery of high-quality software product is a major goal in software engineering. An important aspect is to achieve error free software product that assures quality of the software. Inspection and testing are common verification and validation (V & V) approaches for defect detection in the software development process. Existing statistical data shows that the cost of finding and repairing software bugs raises drastically in later development stages. The Unified
  • 40. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 37 Modeling Language (UML) is now widely accepted as the standard modeling language for software construction and is gaining wide acceptance. The class diagram in its core view provides the backbone for any modeling effort and has well formed semantics. 2. BACKGROUND STUDY Alexander Egyed [4, 5] presents an automated approach for detecting and tracking inconsistencies in real time and to automatically identify changes in various models that affect the consistency rules. The approach observes the behavior of consistency rules to understand how they affect the model. Techniques for efficiently detecting inconsistencies in UML Models identifying the changes required to fix problems are analyzed. The work describes a technique for automatically generating a set of concrete changes for fixing inconsistencies and providing information about the impact of each change on all consistency rules. The approach is integrated with the design tool IBM Rational Rose TM. Muhammad Usman [9] presents a survey of UML consistency checking techniques by analyzing various parameters and constructs an analysis table. The analysis table helps evaluate existing consistency checking techniques and concludes that most of the approaches validate intra and inter level consistencies between UML models by using monitoring strategy. UML class, sequence, and state chart diagrams are used in most of the existing consistency checking techniques. Alexander Egyed demonstrates [3] that a tool can assist the designer in discovering unintentional side effects, locating choices for fixing inconsistencies, and then in changing the design model. The paper examines the impact of changes on UML design models [10] and explores the methodology to discover the negative side effects of design changes, and to predict the positive and negative impact of these choices. Alexander Egyed [1, 2] presents an approach for quickly, correctly, and automatically deciding the consistency rules required to evaluate when a model changes. The approach does not require consistency rules with special annotations. Instead, it treats consistency rules as black-box entities and observes their behavior during their evaluation to identify the different types of model elements they access. Christian Nentwich [6, 7] presents a repair framework for inconsistent distributed documents for generating interactive repairs from full first order logic formulae that constrain the documents. A full implementation of the components as well as their application to the UML and related heterogeneous documents such as EJB deployment descriptors are presented. This approach can be used as an infrastructure for building high domain specific frameworks. Researchers have focused to remove
  • 41. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 38 inconsistencies in few UML Models. The work proposed in [11] attempts to address and detect inconsistencies in UML Models like Class diagram, Use case diagram, Sequence diagram and so on. A survey exploring the impact of model driven software development is given in [12]. Change in impact analysis, consistency management and uncertainty management, inconsistency detection and resolution rules are dealt in the work. 3. FRAME WORK OF THE PROPOSED WORK Figure 1. Framework of the proposed work 4. DETAILED DESCRIPTION OF THE PROPOSED WORK The framework of the proposed work is given in Figure 1. 4.1. Converting UML model into XML file An UML design diagram does not support to directly detect the inconsistency which is practically impossible. UML model is converted into XML file for detecting the inconsistency in the model. UML models such as use case diagram, class diagram and sequence diagram can be taken as input for this tool. The final output of this module is XML file which is used further to detect the inconsistency. The snapshot of getting input file is shown in Figure 2. Extract the XML tags Apply parsing Technique Applying consistency rules Detect Inconsistency in the given input Generate the Inconsistency report Select UML model Convert UML model into XML file
  • 42. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 39 Procedure used:  Convert the chosen input design into a XML file  Select Input File Export as XML file VP-UML project  Select the diagram that needs to be exported  Select the location for exported file to be stored The input file is read from the user to carry out further process (Figure 2). Here, Use Case Diagram is read as input file. The input diagram is stored as XML file and passed as the input to the next process that extracts the XML tags. 4.2. Extracting the XML tags and applying the parsing technique From the XML file, the XML tags are extracted. The parsing technique is applied on the XML tags to identify the related information of the given model which is in Meta model format [3]. For example, in class diagram, the class name, its attributes and methods are identified. All the related information of the given input model is extracted. Procedure used:  Open the XML file  Copy the file as text file  Split the tag into tokens Extract the relevant information about the diagram  Save the extracted result in a file. Figure 3 & 4 describes the above mentioned procedure. The XML file is considered as the input for this step. This method adopts the tokenizer concept to split the tags and store. 4.3. Detecting the design inconsistency: The consistency rules [8, 10] are applied on the related information of the given input design diagram to detect the inconsistency. The related information which does not satisfy the rule has design inconsistency for the given input model. All possible inconsistency is detected as described below. Figure 5 shows the inconsistencies in given use case diagram. 4.3.1. Consistency rule for the Class Diagram:  Visibility of a member should be given.  Visibility of all attributes should be private.  Visibility of all methods should be public.  Associations should have cardinality relationship.  When one class depends on another class, there should be class interfaces notation.
  • 43. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 40 4.3.2. Consistency rule for the Use Case Diagram Every actor has at least one relationship with the use case.  System boundary should be defined.  All the words that suggest incompleteness should be removed such as some and etc. 4.3.3. Consistency rule for the Sequence Diagram  All objects should have at least one interaction with any other object  For each message proper parameters should be included Procedure used:  Select the Input design model  Based on the chosen design model (Class diagram, Use case diagram and Sequence diagram) inconsistency is detected and the extracted result is compared with given consistency rule. 4.4. Generating the inconsistency report A collective report is generated for all the inconsistencies that are detected in the given input model. The report provides the overall inconsistency of the given input model which is taken care during the implementation. 4.5. Computing Design Efficiency The total number of possible errors in the design model is estimated [10]. Then the total number of errors found in the input design model is determined with the procedures discussed. The error efficiency is computed using equation 1. From the calculated error efficiency of the design, the design efficiency is computed using equation 2. The implementation of the same is shown in Figure 6. [eq 1] [eq 2] 5. RESULTS & DISCUSSION In the recent past there has been a blossoming development of new approaches in software design and testing. The proposed system primarily aims to detect the inconsistency which provides efficient design specification. Though there is a lot of research going on in detecting inconsistencies in various UML models, not much work is carried out in Use Case diagram & Class diagram. The developed system doesn’t have
  • 44. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 41 any constraint on the maximum number of lines of code. This added feature makes this tool more versatile when compared with the existing tools. Various design models for different scenarios were taken as samples and tested for consistency. The results obtained proved that the developed tool was able to detect all the inconsistencies available in the given input model. Figure 2. Selecting input model (UML model is the chosen Use Case Design) Figure 3. Snapshot shows the XML Format file that extracted from the input UML Model
  • 45. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 42 Figure 4. Snapshot shows relavent information obtained from the given design from XML file Figure 5. Snapshot shows inconsistency details for the given input design
  • 46. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 43 Figure 6 . Snapshot shows efficency of the given input design model 6. CONCLUSION AND FUTURE ENHANCEMENT Inspection and testing of the software are the important approaches in software engineering practice that addresses to reduce the amount of defects in software products. Software inspection focuses on design specifications in early phases of software development whereas traditional testing approaches focus on implementation phases or later. Software inspection is widely regarded as an effective defect finding technique. Recent research has considered the application of tool support as a means to increase its efficiency. During the design model, construction and validation of variety of faults can be found. Testing at the early phase in software life cycle not only increases quality but also reduces the cost incurred. The developed tool can help to enforce the inspection process and provide support for finding defects in the design model, and also compute the design efficiency on deriving the error efficiency. This work would take care of the major constraints imposed while creating design models such as class diagram, use case diagram and sequence diagram. Further enhancement of the proposed work is to address the other major constraints in class diagrams such as inheritance, association, cardinality constraints and so on. REFERENCES [1] A.Egyed and D.S.Wile, Supporting for Managing Design-Time Decision, IEEE Transactions on Software Engineering, 2006. [2] A.Egyed, Fixing Inconsistencies in UML Design Models, ICSE, 2007.
  • 47. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 44 [3] A.Egyed, Instant Consistency Checking for UML, Proceedings of the International Conference on Software Engineering, 2006. [4] A.Egyed, E.Letier, A.Finkelstein, Generating and Evaluating Choices for Fixing Inconsisentices in UML Design Models, International Conference on Software Engineering, 2008. [5] A Egyed, Automatically Detecting and Tracking Inconsistencies in Software Design Models IEEE Transactions on Software Engineering, ISSN: 0098-5589, 2009. [6] C.Nentwich, I.Capra and A.Finkelstein, xlinkit: a consistency checking and smart link generation service, ACM transactions on Internet Technology, 2002. [7] C.Nentwich, W. Emmerich and A.Finkelstein, Consistency Management with Repair Actions, ICSE, 2003. [8] Diana kalibatiene , Olegas Vasilecas , Ruta Dubauskaite , Ensuring Consistency in Different IS models – UML case study , Baltic J.Modern Computing , Vol.1 , No.1- 2,pp.63-76 ,2013. [9] Muhammad Usman, Aamer Nadeem, Tai-hoon Kim, Eun-suk Cho, A Survey of Consistency Checking Techniques for UML Models , Advanced Software Engineering & Its Applications,2008. [10]R. Dubauskaite, O.Vasilecas, Method on specifying consistency rules among different aspect models, expressed in UML, Elektronika ir elekrotechnika , ISSN 1392 -1215. Vol.19, No.3, 2013. [11]Rumbaugh, J., Jacobson, I., Booch, G., The Unified Modeling Language Reference Manual. AddisonWesley, 1999. [12] Amal Khalil and Juergen Dingel, Supporting the evolution of UML models in model driven software developmeny: A Survey, Technical Report, School of Computing, Queen’s University, Canada, Feb 2013. This paper may be cited as: Thirugnanam, M. and Subramaniam, S., 2014. An Efficient Design Tool to Detect Inconsistencies in UML Design Models. International Journal of Computer Science and Business Informatics, Vol. 9, No. 1, pp. 36-44.
  • 48. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 45 An Integrated Procedure for Resolving Portfolio Optimization Problems using Data Envelopment Analysis, Ant Colony Optimization and Gene Expression Programming Chih-Ming Hsu Minghsin University of Science and Technology 1 Hsin-Hsing Road, Hsin-Fong, Hsinchu 304, Taiwan, ROC ABSTRACT The portfolio optimization problem is an important issue in the field of investment/financial decision-making and is currently receiving considerable attention from both researchers and practitioners. In this study, an integrated procedure using data envelopment analysis (DEA), ant colony optimization (ACO) for continuous domains and gene expression programming (GEP) is proposed. The procedure is evaluated through a case study on investing in stocks in the semiconductor sub-section of the Taiwan stock market. The potential average six- month return on investment of 13.12% from November 1, 2007 to July 8, 2011 indicates that the proposed procedure can be considered a feasible and effective tool for making outstanding investment plans. Moreover, it is a strategy that can help investors make profits even though the overall stock market suffers a loss. The present study can help an investor to screen stocks with the most profitable potential rapidly and can automatically determine the optimal investment proportion of each stock to minimize the investment risk while satisfying the target return on investment set by an investor. Furthermore, this study fills the scarcity of discussions about the timing for buying/selling stocks in the literature by providing a set of transaction rules. Keywords Portfolio optimization, Data envelopment analysis, Ant colony optimization, Gene expression programming. 1. INTRODUCTION Portfolio optimization is a procedure that aims to find the optimal percentage asset allocation for a finite set of assets, thus giving the highest return for the least risk. It is an important issue in the field of investment/financial decision-making and currently receiving considerable attention from both researchers and practitioners. The first parametric model applied to the portfolio optimization problem was proposed by Harry M. Markowitz [1]. This is the Markowitz mean-variance model, which is the foundation for modern portfolio theory. The non-negativity constraint makes the standard Markowitz model NP-hard and inhibits an analytic
  • 49. International Journal of Computer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 9, No. 1. JANUARY 2014 46 solution. Although quadratic programming can be used to solve the problem with a reasonably small number of different assets, it becomes much more difficult if the number of assets is increased or if additional constraints, such as cardinality constraints, bounding constraints or other real-world requirements, are introduced. Therefore, various approaches for tackling portfolio optimization problems using heuristic techniques have been proposed. For example, Anagnostopoulos and Mamanis [2] formulated the portfolio selection as a tri-objective optimization problem that aims to simultaneously maximize the expected return, as well as minimize risk and the number of assets held in the portfolio. In addition, their proposed model also considered quantity constraints and class constraints intended to limit the proportion of the portfolio invested in assets with common characteristics and to avoid very small holdings. The experimental results and a comparison revealed that SPEA2 (strength Pareto evolutionary algorithm 2) [4] is the best algorithm both for the constrained and unconstrained portfolio optimization problem, while PESA (Pareto envelope-based selection algorithm) [3] is the runner- up and the fastest approach of all models compared. Deng and Lin [5] proposed an approach for resolving the cardinality constrained Markowitz mean-variance portfolio optimization problem based on the ant colony optimization (ACO) algorithm. Their proposed method was demonstrated using test data from the Hang Seng 31, DAX 100, FTSE 100, S&P 100, and Nikkei 225 indices from March 1992 to September 1997, which yielded adequate results. Chen et al.[6]proposed a decision-making model of dynamic portfolio optimization for adapting to the change of stock prices based on time adapting genetic network programming (TA-GNP) to generate portfolio investment advice. They determined the distribution of initial capital to each brand in the portfolio, as well as to create trading rules for buying and selling stocks on a regular basis, by using technical indices and candlestick chart as judgment functions. The effectiveness and efficiency of their proposed method was demonstrated by an experiment on the Japanese stock market. The comparative results clarified that the TA- GNP generates more profit than the traditional static GNP, genetic algorithms (GAs), and the Buy & Hold method. Sun et al. [7] modified the update equations of velocity and position of the particle in particle swarm optimization (PSO) and proposed the drift particle swarm optimization (DPSO) to resolve the multi-stage portfolio optimization (MSPO) problem where transactions take place at discrete time points during the planning horizon. The authors illustrated their approach by conducting experiments on the problem with different numbers of stages in the planning horizon using sample data collected from the S&P 100 index. The experimental results and a comparison indicated that the DPSO heuristic can yield superior efficient frontiers compared to PSO, GAs and two classical