Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network

19
International Journal of
Applied Sciences & Engineering
www.ijapscengr.com ISSN 2308-5088 editor@ijapscengr.com
RESEARCH ARTICLE
Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network
1
James Agajo, 2
Nosiri Onyebuchi, 3
Okhaifoh J and 4
Godwin Okpe
1
School of Engineering, Federal University of Technology Minna Nigeria; 2
Department of Electrical/Electronic Federal
University of Technology Owerri Nigeria; 3
Federal University of Petroleum Resources Effurun Delta State Nigeria
4
Programmes Department National Board for technical Education Kaduna Niger
ARTICLE INFO ABSTRACT
Received:
Revised:
Accepted:
February 12, 2015
March 20, 2015
June 22, 2015
Efficient data filtering algorithm for Big Data technology Telecommunication is
a concept aimed at effectively filtering desired information for preventive
purposes, the challenges posed by unprecedented rise in volume, variety and
velocity of information has necessitated the need for exploring various methods
Big Data which is simply a data sets that are so large and complex that
traditional data processing tools and technologies cannot cope with is been
considered. The process of examining such data to uncover hidden patterns in
them was evolved, this was achieved by coming up with an Algorithm
comprising of various stages like Artificial neural Network, Backtracking
Algorithm, Depth First Search, Branch and Bound and dynamic programming
and error check. The algorithm developed gave rise to the flowchart, with each
line of block representing a sub-algorithm.
Key words:
Big data
Filtering
Variabilty
Velocity
Volume
*Corresponding Address:
James Agajo
james.agajo@futminna.edu.ng
Cite This Article as: Agajo J, N Onyebuchi, J Okhaifoh and G Okpe, 2015. Efficient data filtering algorithm for big
data technology in telecommunication network. Inter J Appl Sci Engr, 3(1): 19-26. www.ijapscengr.com
INTRODUCTION
Background
Since the emergence of satellite technology, the
world has experience a massive explosion of data transfer
from one point to another, and with an endless demand for
increased bandwidth to accommodate some very pressing
necessities like
i. Facebook
ii. Tweeter
iii. Skpe
iv Voice call
v Video Call
vi Google plus
vii. Flick
viii. Linkedln
ix. Amazon
x etc
All of this add to the number of data transfer between
source and sink. This large chunk of data has made it for
the present day convention to manage, therefore the need
for Big Data has become necessary.
Big data is a broad term for data sets so large or
complex that traditional data processing applications
cannot handle. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
and information privacy. The term often refers simply to
the use of predictive analytics or other certain advanced
methods to extract value from data, and seldom to a
particular size of data set. [1]
Accuracy in big data may lead to more confident
decision making. And better decisions can mean greater
operational efficiency, cost reductions and reduced risk
(Big Data, 2015), Big Data is simply a data sets that are so
large and complex that traditional data processing tools
and technologies cannot cope with. The process of
examining such data to uncover hidden patterns in them is
referred to as Big Data Analytics (Christopher S., 2014).
There are more to gain when Exploiting big data in
telecommunications which increase revenue, reduce
customer churn and operating costs
Communications service providers all over the globe
are seeing an unprecedented rise in volume, variety and
velocity of information ("big data") due to next generation
mobile network rollouts, increased use of smart phones
and rise of social media (Keith C.C. Chan, 2013).
Service providers who can tackle the big data
challenge will differentiate from competitors, gain market
share and increase revenue and profits with innovative
new services. The essence of this paper is to come up with
ways of effectively filtering desired information for
preventive purposes like in the area of
i. Gang/drug offences
ii. Violent crime
iii. Cyber crime

Inter J Appl Sci Engr, 2015, 3(1): 19-26.20
iv Advance fee fraud
v. Cultism e.t.c
Crime intelligence extension
A robust e-crime detection technique will Lower risk,
detect fraud and monitor cyber security in real time. Big
Data will augment and enhance cyber security and
intelligence analysis platforms with big data technologies
to process and analyze new types (e.g. social media,
emails, sensors, Telco) and sources of under-leveraged
data to significantly improve intelligence, security and
law enforcement insight. Data could be filtered with the
aid of keywords.
Keyword is a word that is reserved by a program
because the word has a special meaning. Keywords can
be commands or parameters. [5]
Instead of chasing and finding criminals on the
street Big Data provides a medium where certain
characters can be detected and arrested within a
telecommunication network with the aid of data.
Successfully harnessing big data can help service
providers achieve three critical objectives for
telecommunications transformation: Deliver smarter
services that generate new sources of revenue; Transform
operations to achieve surveillance and service excellence;
and Build smarter networks to drive consistent, high-
quality data capture.
Unlike Data mining which refers to the activity of
going through collection of data with a view to looking
for relevant or pertinent information. This type of activity
is really a good example of the old axiom "looking for a
needle in a haystack." The idea is that businesses collect
massive sets of data that may be homogeneous or
automatically collected. Decision-makers need access to
smaller, more specific pieces of data from those large sets.
Data mining can help to uncover pieces of information
that will inform transparency and help chart the course for
business, Data mining is also relevant in cloud
computing. (Amir Gandomi (2015).
Data mining
Data mining can involve the use of different kinds of
software packages such as analytics tools. It can be
automated, or it can be largely labor-intensive, where
individual workers send specific queries for information
to an archive or database. Generally, data mining refers to
operations that involve relatively sophisticated search
operations that return targeted and specific results.
Cloud computing
Cloud computing is a general term for anything that
involves delivering hosted services over the Internet as
shown in figure 1 and 1.1.
Cloud computing could also be amongst the factor for
the increase in volume.
Cloud computing is a general term for anything that
involves delivering hosted services over the Internet.
These services are broadly divided into three categories:
Infrastructure-as-a-Service (IaaS), Platform-as-a-Service
(PaaS) and Software-as-a-Service (SaaS).
For big data we see cloud computing as a veritable
too for enhancing data filtering technique Big data is a
popular term used to describe the exponential growth and
availability of data, both structured and unstructured. And
big data may be as important to business – and society –
as the Internet has become. Why? More data may lead to
more accurate analyses. More accurate search may lead to
more confident result. And better decisions can mean
greater operational efficiencies, cost reductions and
reduced risk. Analysis of data sets can find new
correlations, to "spot business trends, prevent diseases,
combat crime and so on."[1]
Scientists, practitioners of
media and advertising and governments alike regularly
meet difficulties with large data sets in areas
including Internet search, finance and business
informatics. Scientists encounter limitations in electronic/
science work, including meteorology, genomics,
connectomics, complex physics simulations, and
Figure 1: Network of data transmission

Fig. 1.1: Cloud computing
Fig. 1.3: Data range
Fig. 2: current Cisco Visual Networking Index
biological and environmental research. [3] Big data is
characterised by 3 V'S.
Volume: Many factors contribute to the increase in data
volume. Transaction-based data stored through the years.
Unstructured data streaming in from social media.
Increasing amounts of sensor and machine-to-machine
data being collected. In the past, excessive data volume
was a storage issue. But with decreasing storage costs,
other issues emerge, including how to determine
relevance within large data volumes and how to use
analytics to create value from relevant data.
Velocity: Data is streaming in at unprecedented speed and
must be dealt with in a timely manner. RFID tags, sensors
and smart metering are driving the need to deal with
torrents of data in near-real time. Reacting quickly enough
to deal with data velocity is a challenge for most
organizations.
Variety: Data today comes in all types of formats.
Structured, numeric data in traditional databases.
Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock
ticker data and financial transactions. Managing, merging
and governing different varieties of data is something
many organizations still grapple with.
Variability: In addition to the increasing velocities and
varieties of data, data flows can be highly inconsistent
with periodic peaks. Is something trending in social
media? Daily, seasonal and event-triggered peak data
loads can be challenging to manage. Even more so with
unstructured data involved.
Complexity: Today's data comes from multiple sources.
And it is still an undertaking to link, match, cleanse and
transform data across systems. However, it is necessary to
connect and correlate relationships, hierarchies and
multiple data linkages or your data can quickly spiral out
of control.
Data sets grow in size in part because they are
increasingly being gathered by cheap and numerous
information-sensing mobile devices, aerial(remote
sensing), software logs, cameras, microphones,
radiofrequency identification (RFID), readers, and
wireless sensor networks. The world's technological per-
capita capacity to store information has roughly doubled
every 40 months since the 1980s; as of 2012, every day
2.5 exabytes (2.5×1018
) of data were created;[9]
The
challenge for large enterprises is determining who should
own big data initiatives that straddle the entire
organization.
The real issue is not that you are acquiring large
amounts of data. It's what you do with the data that
counts. The hopeful vision is that organizations will be
able to take data from any source, harness relevant data
and analyze it to find answers that enable
1) Cost reductions
2) Time reductions
3) New product development and optimized offerings
4) Smarter business decision making. For instance, by
combining big data and high-powered analytics, it is
possible to:
Optimize routes for many thousands of package delivery
data in the course of transmission.
IP traffic
The current Cisco Visual Networking Index (VNI)
forecast projects global IP traffic to nearly triple from
2014 to 2019. See Appendix A for a detailed summary.
Overall IP traffic is expected to grow to 168 exabytes per
month by 2019, up from 59.9 exabytes per month in 2014,
a CAGR of 23 percent as shown in figure 2
The New York time provide the following data on log
files
• 50Gb of uncompressed log files
• 10Gb of compressed log files
• 0.5Gb of processed log files
• 50-100M clicks
• 4-6M unique users
• 7000 unique pages with more then 100 hits
• Index size 2Gb
• Pre-processing & indexing time ~10min on
• workstation (4 cores & 32Gb)
• ~1hour on EC2 (2 cores & 16Gb

Fig. 3: Intercontinental network for Data route Transmission 1
Source: Marko Grobelnik Jozef Stefan (2012).
Fig. 4: Intercontinental network for Data route Transmission 2
Source: Marko Grobelnik Jozef Stefan (2012)
Fig. 5: Branch of distributed data Sensing device
The diagram in figure 3&4 are optimised route
diagram for communication channels where data is been
transmitted it is estimated that in one minute a total of One
minute on the Internet:
640Tera Byte data transferred,
100k tweets,
204 million e-mails sent.
MATERIALS AND METHODS
This work develops an algorithm which filters
specified data in a large chunk of data which involves
searching for correct data by carrying out search within a
particular search space and also applying the principle of
backtracking procedure, the data to be searched could
KEYWORDS numbering more than a million so large and
complex, by adopting Depth First Search (DFS), the bank
of data is first subjected to data search with the aid of
distributed data sensing device(software) which will cover
the large pool of data, the sensing device will be
distributed within the search space of the pool of data,
note that Big Data is simply a data sets that are so large
and complex that traditional data processing tools and
technologies cannot cope with, we adopt the Backtracking
procedure to realising the first round of data filtering
within the search space.
In been explicit the following method will be used for
this work
1. Neural Network
2. Backtracking Algorithm
3. Depth First Search (DFS)
4. Branch and Bound
5 Dynamic Programming
6. Error detection
Branch and bound
The Branch and Bound method will be used to
distribute the search space for each data sensing node
called distributed data sensing method.
The distributed sensing method will be guided by the
principle of branch and bound, A branch-and-bound
algorithm consists of a systematic enumeration of Data
solutions by means of state space search: the set of
candidate solutions is thought of as forming a rooted
tree with the full set at the root. The algorithm
explores branches of this tree, which represent subsets of
the solution set. The branches can well be described from
the diagram shown in figure 5.
The diagram shows the branch tag to the various
distributed data sensing system, since the Big Data pool is
quite large and complex, the need to adopt the bounding
method becomes necessary, this will specifically assign
sensing Boundaries to the data sensing devices. (Miller et
al., 2012).
The bounding method is a design paradigm
for discrete and combinatorial optimization problems. A
branch-and-bound algorithm consists of a systematic
enumeration of the solutions by means of state space
search: the set of data solutions is thought of as forming
a rooted tree with the full set at the root.
S = ……………………………………..1
S stand for big data bank while
each data sensing device has a specific boundary it can
cover and it is represented using recursive statements like
Si= , Sj= , Sk= , SL=
This equation describes the bound for data detecting
sensor can cover.
When this device senses the data bank, note not all the
data is useful the system is made to get rid of the
unwanted data by using the principle of backtracking.

The backtracking algorithm enumerates a set of partial
result (Data) that, in principle, could be completed in
various ways to give all the possible solutions to the given
problem. The completion is done incrementally, by a
sequence of candidate extension steps.
Conceptually, the partial result is represented as the
nodes of a tree structure, the potential search tree. Each
partial candidate is the parent of the candidates that differ
from it by a single extension step; the leaves of the tree
are the partial result that cannot be extended any further.
The backtracking algorithm traverses this search
tree recursively, from the root down, in depth-first order.
At each node c, the algorithm checks whether c can be
completed to a valid solution.
If it cannot, the whole sub-tree rooted at c is skipped.
Otherwise, the algorithm (1) checks whether c itself is a
valid solution, and if so reports it to the admin and (2)
recursively enumerates all sub-trees of c. Therefore,
the actual search tree that is traversed by the algorithm is
only a part of the potential tree. The total cost of the
algorithm is the number of nodes of the actual tree times
the cost of obtaining and processing each node. This fact
should be considered when choosing the potential search
tree and implementing the skipping test.
The first source sensing node starts the sensing
process by picking up some particular keywords it can
handle, that is done by all the sensors at the source, but
since the data bank is so large, accuracy becomes a key
issue, therefore the need for an Artificial neural network
becomes necessary. a neural network set will to further
filter the information so as to get the right kind of data we
are looking for,. each source sensing device will relay
signal to the next sensing node for further filtering the
information to be sensed is X with a weight level W,
Design methodology
This work adopts the neural network application
using the forward propagation method the sensing nodes
are referred to as R this include the source node and the
intermediary node while the F is the Final sink node,
where all data converges and are collected after filtering
and processing.
We come up with a design where the data is a
function of weight w and input value x
X1- W1
X2 – W2
. .
. .
Xn– Wn
A forward propagation technique will require a
transition for the neural set for inter-node communication,
considering the values of x as input value and w as
weight.
We therefore express the process of data transmission
between nodes as
X11W1- X12W2– X13 W3 …………....X1nWn
X21W1-- X22W2– X23 W3 ....………..,X2nWn
X31W1-- X32W2– X13 W3 ……….......X3nWn
. .
. .
. .
Xm1W1 – Xm2W2 – Xm3W3 ………………XnWN
In this work we refer to reduced functional device
RFD as the source and intermediary node which is the
input and the full functional device FFD as the sink, we
therefore substitute the input x in the neural set as R while
FFD is substituted for F.
R = RFD
F = FFD
We refer data from RFD as R
X = R
R11W1-R12W2–R13W3…………………..R1nWn
R21W1 - R22W2 – R23 W3 ….....................R2nWn
R31W1 - R32W2 – R13 W3 ………....…….R3nWn
. .
. .
Rm1W1 – Rm2W2 – Rm3W3 ….............................RmnWn
Fig. 6: Neural Network Matrix for WSN
The neural set above directly substitutes x for R the
matrix represent the solution for the equation in figure 6
The Matrix in the equation which can then be expressed
as logistic regression in figure 7 note this model is only
implementable when the addressing mode is perfect for a
linear routine process.
The neural network will look like this
Fig. 7: Recurrent Neural Network Model for WSN
A feed forward , the first column is the input R with its
corresponding weight W , followed by the repeater node
which is referred to as intermediate node, they could also
be termed as hidden layers, which ranges between 1,l and
L.
Which can be represented with a black box
This is logistic regression ,they have the same structure
like the linear model, where we have the input combined
linearly using weight sum up into a signal which passes
soft threshold.
The eventual equation becomes

F(R) = FNN(R-1), y(R-2),y(R-3)……..y(R-n)..2
Fig. 8: Black box for hidden layers (intermediary node)
SJ = Signal
Wij = weight
Fig. 9: Layer representation in NN Model
The idea is to evolve an equation to compute for data
packet error between transmitted node and receiving node.
Data loss occurs when one or more packets of data
travelling across the node and fail to reach their
destination. the equation draws a relationship between the
source node and the sink node.
F(R) = ...3
A linear model is proposed, which is logistic regression
where we have input combined linearly using weight
which sum up to signal, this model is meant to implement
a genuine probability, error measure which will be based
on likelihood measure given below
∏P(Fn| Rn) =
P is the probability of error and is maximised between
output F, and input R and then derive error measured as
L = layers
d= dimension
j= output
Note w refer to data weight
Rj
(i)
= Ө(SJ
(i)
= Ө(
The operation of the above equation shows how new
value of input R is gotten from previous R. The process is
recursive.
Ө =tanhh(s)
All weight are represented by w=wijl
i
error between transmitter and receiver can be defined as
= (dij - ....4
i = row
for equation j= column
R stands for receiving and transmitting node
d is the range between transmitted and received data
To find error measurement
We can evaluate δ℮(w)/δwi
ij
one by one analytically or numerically
V℮(w) = δ℮(w)/δwi
ij
a trick for efficient commutation is the product rule
to solve for equation….
we apply product rule
δ℮(w)/δwi
ij= δ℮(w)/δsji x δsji/δwi
ij
this is the estimate error between two nodes
the diagram is an illustration of the transmission process,
in between are intermediary nodes though not shown.
SJ = Signal
Wij = weight
For every data transmitted between two sensor, we
can compute the error in data transmitted can be routed
between one node from layer L(i=1) to L(i=n), layer L
represent various layer of inter-sensor node between the
source and the sink.
Which can be represented by
To compute for data error for sensors that are inter linked
wirelessly from source node through intermediary node to
the sink node we advance the formular in equation…3
.... 3
δ℮(w)/δwi
ij= δ℮(w)/δsji x δsji/δwi
ij
Implementation
This work adopt a backtracking method for to realize data
recovery and to also ensure that filtering is accurate,
P stand for the pool of data and the root
C Stand for the data to be retrieved the Backtracking
algorithm will be in this form, first the procedure
In order to apply backtracking to a specific class of
problems, one must provide the data P for the particular
instance of the problem that is to be solved, and
six procedural parameters, root, reject, accept, first, next,
and output. These procedures should take the instance
data P as a parameter and should do the following:
root(P): return the partial candidate at the root of the
search tree.
reject(P,c): return true only if the partial candidate c is not
worth completing.
accept(P,c): return true if c is a solution of P,
and false otherwise.
first(P,c): generate the first extension of candidate c.
next(P,s): generate the next alternative extension of a
candidate, after the extension s.
output(P,c): use the solution c of P, as appropriate to the
application.
The backtracking algorithm reduces then to the
call bt(root(P)), where bt is the following recursive
procedure:
procedurebt(c)
if reject(P,c)then return
if accept(P,c)then output(P,c)
s ← first(P,c)
while s ≠ Λ do
bt(s)
s ← next(P,s)
RI
(i=1)
Sj
(i)
wji
(i)
Wj
(i)

Fig. 10: Map of Intersection delay layers George Hadley (1960).
Fig. 11: Compact representation of the network (Ronald A.
Howard (1962)).
Dynamic programming
Dynamic programming is an optimization approach
that transforms a complex problem into a sequence of
simpler problems; it’s essential characteristic is the
multistage nature of the optimization procedure. More so
than the optimization techniques described previously,
dynamic programming provides a general framework for
analyzing many problem types.
Within this framework a variety of optimization
techniques can be employed to solve particular aspects of
a more general formulation. Usually creativity is required
before we can recognize that a particular problem can be
cast effectively as a dynamic program; and often subtle
insights are necessary to restructure the formulation so
that it can be solved effectively.
Our first decision (from right to left) occurs with one
stage, or intersection, left to go. If for example, we are in
the intersection corresponding to the highlighted box in
Fig. 11, we incur a delay of three micro-seconds in this
intersection and a delay of either eight or two micro-
seconds in the last intersection, depending upon whether
we move up or down. Therefore, the smallest possible
delay, or optimal solution, in this intersection is 3+2 = 5
micro-seconds. Similarly, we can consider each
intersection (box) in this column in turn and compute the
smallest total delay as a result of being in each
intersection. The solution is given by the bold-faced
numbers in Fig 10. The arrows indicate the optimal
solution, up or down, in any intersection with one stage,
or one intersection, to go
Algorithm
The Algorithm for realising this work is as follows
i. Start
ii. Branch and Bound
iii Backtracking
iv. Depth First Search (DFS)
v. Dynamic Programming
vi. Artificial Neural Network
vii. Error Check
vii. Stop
Note that each line in the algorithm is a sub-algorithm on
its own which the entire page in this write up cannot
contain, however the flowchart in figure 12 summarises
the entire procedure
Fig. 12: Flowchart
The general constraint satisfaction problem consists
in finding a list of integers x = (x[1],x[2], ..., x[n]), each in
some range {1, 2, ..., m}, that satisfies some arbitrary
constraint (Boolean function) F.
For this class of problems, the instance data P would
be the integer’s m and n, and the predicate F. In a typical
backtracking solution to this problem, one could define a
partial candidate as a list of integers c = (c[1],c[2],
... c[k]), for any k between 0 and n, that are to be assigned
to the first k variables x[1],x[2], ..., x[k]). The root
candidate would then be the empty list ().
The first and next procedures would then be
function first(P,c)
k ← length(c)

if k = n
then return Λ
else return (c[1], c[2],..., c[k],1)
function next(P,s)
k ← length(s)
if s[k]= m
then return Λ
else return (s[1], s[2],..., s[k-1],1+ s[k])
Here "length(c)" is the number of elements in the list c.
Conclusion
With the failing tendencies arisen from the inability
of existing system to carry out effective data filtering for
event monitoring, the combination of these blocks of
seven blocks of flowchart of sub-algorithm is a way out of
advancing the use of Big Data to sniff network and get
precise information from large chunk of data for
predictive and analytic purposes.
This Predictive analytics which comprises a variety
of techniques that predict future outcomes based on
historical and current data will help predict and also
forecast event , Filtering can be achieved with the aid of
this tools presented which could be used for crime
prediction and reduction and much more.
REFERENCES
Chan C, 2013. Big Data Analytics for Drug Discovery,
IEEE International Conference on Bioinformatics and
Biomedicine,
Marko Grobelnik Jozef Stefan, 2012. Institute Ljubljana,
Stavanger, Slovenia
Amir Gandomi, 2015. Murtaza HaiderTed, Beyond the
hype:
Big data concepts, methods, and analytics
Big Data, may 2015 Http://en.wikipaedia.org/wiki
retrieved 11th. June 2015
Christopher Surdak, "Jump up^ "Data Crush by.
Retrieved 14 February . 2014
George Hadley, 1960. Nonlinear and Dynamic
Programming, Addison-Wesley,, by. Exercise 10 is
based on Section 10.6
Keith C S. Miller, S. Lucas, L. Irakliotis, M. Ruppa,
T.Carlson and B.Perlowitz, 2012. “Demystifying Big
Data: A practical Guide to Transforming the Business
of Government”, Washington: Tech America
Foundation.
Ronald A. Howard, 1962, Dynamic Programming and
Markov Processes, John Wiley & Sons, Exercise 24.

Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network

Similaire à Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network (20)

Plus de Onyebuchi nosiri

Plus de Onyebuchi nosiri (20)

Dernier

Dernier (20)

Efficient Data Filtering Algorithm for Big Data Technology in Telecommunication Network