SlideShare une entreprise Scribd logo
1  sur  9
Information Retrieval using Dynamic Indexing
Sura I. Mohammed
Computer science department
Faculty of computer science
Cairo University
Egypt, Cairo
suib200684@yahoo.com
Hussien M. Sharaf
ITC department
Arab Open University
Egypt, Cairo
hussiensharaf@from-masr.com
Fatma A. Omara
Computer science department
Faculty of computer science
Cairo University
Egypt, Cairo
f.omara@fci-cu.edu.eg
Abstract—Since the demand for information retrieval increases
quickly, indexing structures became an important issue to support fast
information retrieval. According to the work in this paper, a new data
structure called Dynamic Ordered Multi-field Index (DOMI) for
information retrieval has been introduced. It is based on radix trees
organized in segments in addition to a hash table to point to the roots
of each segment, where each segment is dedicated to store the values
of a single field. The hash table is used to access the needed segments
directly without traversing the upper segments. So, DOMI improves
look-up performance for queries addressing to a single field. In the
case of multiple queries addressing, each segment of the radix tree is
traversed sequentially without visiting the unrelated branches. The
use of segmentation for the proposed DOMI provides flexibility for
minimizing communication overhead in the distributed system.
Every field in the radix tree is represented by one segment, where
each segment can be stored as one block.
In addition to, the proposed DOMI consumes less space comparing to
indexes which are built using B or B+ trees. Hence, it is more
suitable for intensive-data such as Big Data.
Keywords: Dynamic Order Multi-field Index (DOMI), Data query,
indexing structures, Big Data.
I. INTRODUCTION
As the size of data sets grows bigger, some research works
had been done to provide flexible and efficient query
mechanisms to extract data from a set of fields of data
streaming. There are several existed data structures that can be
used for this task as B and B+. Particularly, index mechanisms
are needed to help the retrieval information quickly and
according to their location within large Data sets where the
Main objective of indexing is to optimize the speed of query
[22].
However, there are two major phases for supporting
dynamic queries through a large data set. The first phase is
preparing data sets. The second phase is to give an answer to a
query after using indexing for data that result from one phase.
The second phase will be considered in this paper.
An index is considered an efficient data structure to retrieve
objects by given the value of one or more elements of those
objects [4]. This scales gracefully to large numbers of keys
and insensitive to the length or content of inserted strings [2].
Data processing based on queries can produce useful
information. However, answers to such queries reflect the
stored information directly and avoid searching unnecessary
field values.
The basic idea of field’s index is to ignore unnecessary
fields during processing queries. The logical unit of a tuple as
entry in field index is an important where each entry index is a
tuple with order which is decided by the Index designer.
Therefore, Tuple affects directly the performance of query
execution.
Predilection on processing queries over large Data set has
contributed in the improvement of index structure, such as
projections strategy indices (RB+), Value-List Index (B+ tree)
and more complex index structures to speed up the process of
queries evaluation [15]. While efficient query processing
specifications have been achieved from these indices,
querying huge data sets, especially streaming data, had
suffered from the time overheads during answering the queries
that involve multiple fields. That problem is happened mainly
in traditional search Trees, because of the sequential access
(i.e., static order). By using static order index, the search
should traverse all values in each entry to find matching
values for the required field which might be a non-leading
field. In such case, the search is nearly a sequential search.
The problem of using static order index for answering
queries which involves multiple fields index, i.e., single index
can reference multiple field (e.g., State, City, Zipcode, Web
Site), is illustrated Fig 1. According to Fig. 1, the index will
not be helpful when the query is processed using State and Zip
fields because unnecessary fields will be searched to answer
queries. The only field that benefits from the index structure
is the leading field while the rest of the fields are nearly
unsorted.
Fig 1. Single Multi-field Index
According to the work in this paper, multi-field searching
problem using static ordering of the large data set has been
resolved by introducing a dynamic ordered index for holding
values of multiple fields based on the Radix tree (DOMI) as a
basic structure.
The mean advantages of Radix tree relative to other trees
such as B/B+ are that efficient in terms of storage, fast on
look-up operations, and the data is maintained in sorted order
[16] [9]. Also it supports other operations such as prefix
lookup and row update.
The dynamic property of the proposed DOMI structure has
been satisfied by using a hash-table together with Radix trees.
It allows a search to proceed directly to a root field. Hence it
helps the query to access the data items in parallel with
reasonable time. On the other hands, the dynamic index allows
a search to proceed directly to a target data portion of the
index that can answer a query.
The remainder of this paper is organized as follows; Section
II discusses related work. Section III starts with some general
observations on the Radix trees advantages over B/B+ tree,
and how to build index of fields using Radix tree. Section IV
presents the building multi-field using radix tree. Section V
presents the proposed DOMI structure and how to build index
where order of fields can dynamically change according to the
given query. Finally, section VI presents the conclusions of
the paper.
II. RELATED WORK
More organizations are running into problems with
processing big data every day [7]. In real life, data stores
contain millions of data for real world objects and the data
searching is most common and is always used to retrieval of
data. So, to improve the performance of retrieving data, data
indexing is required [19].
There are problems for indexing huge amount of digital
information. The problem of storing, indexing and searching
data set has gained increasing attention. A survey about
spatial indexing is discussed in [6]. According to the survey in
[6], the problem of designing efficient indexes to support
spatial objects has been addressed. On the other hands, the
reason for creating an index for a data set is to speed the
access to a subset of the data [21]. Index structures are
different in terms of structure, query support, data type support
and application [19]. Tuple reconstruction is an important
component in the column-stores, and affects the performance
of query execution. Therefore, it is necessary to perform the
process of Tuple reconstruction before query execution by
using main indexes and jointing address mapping indexes [8].
Radix tree has property a key prefix that allow efficiently
index in main memory. Query Processing in QPPT keeps the
index materialization costs low, and uses optimal prefix trees
to satisfy balanced read/write operations which are known to
be main memory optimized [12].
On the other hands, the Adaptive Index Buffer reduces the
cost of table scans by quickly indexing tuples in the memory
until the partial index to be adapted to the workload again, but
it covers only a subset of the values of a column [14].
Based on data structure to store tuples of fields, B-tree is
considered a simple existed method that permits storing
vertical partitions in traditional B-tree indexes with practically
zero overheads for storing the tuples [10]. Even so, a B tree is
usually used to search tuples of fields in a static order. By
given a query, the search begins from the root, and checks
each child sequentially until the query is found.
According to the work in this paper, a radix tree is used to
store field values based on the characteristic adaptive radix
tree by using dynamically choosing compact internal data
structures to overcome the common problem in the worst-case
space consumption and enable efficient parallel access [9].
The Radix tree saves storage space by exploiting the common
prefixes in the string set.
Index Fabric based on B-tree uses a segmented approach of
Patricia to allow a search to proceed directly to a block-sized
portion of the index that can answer a query [2]. According to
index fabric, the search proceeds from segment to segment
until the desired data segment is found.
Comparing to the existed approaches, using Radix tree to
build indexes is considered very helpful especially in a stream
Data querying. Efficient Search is considered the most
important criterion for selecting data structures because search
is normally carried out on-line (and thus needs quick response)
and will be carried out many times [20]. Traditionally, it uses
auxiliary data structures, such as B-Trees, Hash Indexes, and
Bitmap Indexes. These data structures have excellent
performance. The indices are used to provide a quick and easy
access to data, save time and operations in searching, inserting
of data, etc.
This paper intends to use two different data structure for
index construction.
III. RADIX TREE (RT) OVERVIEW.
Using Radix tree provides the advantages of reduction in
the storage space required for storing values. It also provides
great efficiency to retrieve any information [3]. Regarding
object storage, the radix tree uses a simple key/value model
depending on the characteristics of radix trees. It also enables
parallel access of sub-radix trees
Generally, a Radix tree is a hierarchical structure composed of
internal nodes and leaf nodes, where [3]:
• Internal Nodes; contain pairs of the form (key, P). An
entry in an internal node contains a pointer (P) pointing to
a lower level node in the sub-tree and a key is the field
name. Structure of inner node for this tree capacity where
values can contain more complex data types.
• Leaf nodes; store the values corresponding to the keys.
The useful properties of Radix tree are [13]:
• Look-up; determines if a string exists in a tree.
• Insertion; either add a new outgoing edge labeled with all
remaining elements in the input string, or find longest
common prefix, split it into two edges then add suffix.
• In addition to the ordering of the keys that are sorted
lexicographically. It supports another operations e.g.
(rang scan, prefix lookup, update).
With respect to the performance of operations, k-ary search
trees fail to support incremental update operations [17]. And
B+
trees have expensive update operations [18]. On the other
hands, Radix trees doesn’t have such expensive update
operations because they need minimal re-structuring of nodes
compared to B+
trees which need time consuming insertion
algorithm. .
One advantage of the Radix tree is that it depicts early if
there are possible matches. While other search trees such as
binary tree, the decision will probably be the slowest because
it has to search through levels of tree nodes, then, the result of
comparisons cannot be predicted easily [9]. Finally the reasons
of using a Radix tree are that it provides faster look-up,
efficient insertions, and updates, supports range scans and
prefix look-ups as the data is sorted.
IV. BUILDING INDEX FILED USING RADIX TREE (RT).
In this section, using the radix trees to build convenient
indexes and for storing and answering a query will be
discussed. Queries can be processed efficiently with specific
indexes. Internal nodes are used as index to insert and locate
data efficiently from the radix tree with minimum time.
There are two types of nodes:
• The Root Node (RN) such as (FN1, FN2… FNk), where
each field FNi belongs to the set of a field's header in the
original data. Each RN is used as a root for a Sub-Radix
Trees (SRT). Each internal node stores one element only.
• The Data Node (DN) contains two issues; The prefix
value pv which is a common prefix for two values v that
have the same prefix. This saves space for two inner
nodes by truncating the path to the leaf. DN can be
described in form of a tuple (V, {offset}) where value V
has two type:
• v is a complete value that exists in the original data file .
• sv is a suffix value for that prefix value of the parent
node.
An entry of Data Node consists of a pointer pointing to the
data and an offset or set of offsets which covers the item’s
location in original file.
Fig. 2 shows the structure of both Root Node and Data Node
in the Radix Tree. In this paper, the radix tree is built on
qualifying more set of fields. A candidate index in the data set
is decided according to a certain criteria, and also according to
the relationships between the fields themselves, and additional
optimal number of fields. It is necessary to perform a pre-
process of fields before index construction. A possible pre-
processing is organizing values of each record in the form of
tuples.
Fig 2. Single node of tree
Definition 1:
A RN is defined in the form of the following:-
RN = (FN, {p1,p2,…pi}), where FN indicates a field name in
the original data. Pi points to another type Node of the tree.
Definition 2:
A DN is defined using three forms as follows:-
• DN= (v, {o1,o2,..,oi}, { p1,p2,…pi }), where v is a complete
value from the domain of the values to be indexed. o is
an offset or position of this value in original data set. P
points to a node which could be a DN in the same SRT or
a RN in a new SRT.
• DN= (pv, { p1,p2,…pi }), where pv is a common prefix for
two values v that have the same prefix. P points to a node
which conation suffix value.
• DN= (sv, {o1,o2,..,oi}, { p1,p2,…pi }), where sv is suffix
value for that prefix value of parent node. p points to a
node which could be a DN in the same SRT or a RN in a
new SRT.
The values of each field are grouped into a segment that
may contain one or more Sub-Radix Trees (SRT). The order
of segments is initially decided according to the relationships
between the fields. The increasing segments {1, 2, …, n} from
the highest to the lowest is illustrated in Fig 3.. Each segment
contains one or more block-sized sub-radix tree (SRTi) as
shown in Fig 4. The Root Node (RN) of one SRT stores a key
as (FN1) and each one of Data Nodes (DN) refers to a value
that belongs to FN1..
Fig 3. Segments in Radix Tree
One of the values as (v1FN1,…), is inserted by either, adding
a new node that stores a complete value as shown in Fig 3.
The other way is to find the longest common prefix as
(pv2FN1) then add another node to hold the remaining suffix
values. Storing a common prefix value only once saves
storage space. In this case, a different suffix is stored in
separate nodes(sv2.1FN1, sv2.2FN1), and so on for each value.
The Radix Tree (RT) indexes stores values (v1FN1,
sv2.2FN1), together with the position (offset) of that value that
refers to a location of the actual spatial-data they represent.
Multiple values in can reference same offset.
The search is done by comparing each tuple generated by a
user’s query which is coming in the form of tuple query (field,
value, operator) where the operator could be equal, greater
than or etc... with each tuple of the original data set which
should have its tuple data-set Row (field, value, Data Type).
The search process descends from the root at the highest
segment and proceeds nodes within the same segment, or
transferred them to the next segment, where the result of the
search in one segment is either a pointer to data - if the search
key matches the data key, or a pointer to another segment, or
null. Thus each query may require passing more than one
segment to find the answer of query to any search process.
More than one SRT under a visited node may need to be
searched; hence it might not possible to guarantee good
performance. To avoid a sequential search on the whole tree, a
dynamic index could provide a good solution. The dynamic
index is based on radix tree which will help to retrieve data
quickly according to their locations and require visiting only a
small number of nodes.
A single segment where sub-radix trees (SRTs) are grouped
together is shown in Fig. 4. Each SRT has a root node that
stores the field name.
Fig 4. Sub Trees (ST) in one segment
Definition 3:
A SRT is defined in the form according to the following:
SRT = (N, E, n0) is an acyclic graph where:
• N is a set of nodes {n0,….,nk}, where k > 0, n0 ∈ N,
• E is the set of links {e0,…., em }; where m>=0 and eij is a
pair (ni, nj) such that ni ∈ N and nj∈ N.∀ pair (ni, nj) ∈ E; ni
≠ nj,
• Finally; n0 is the only RN in a single SRT. ∀ni ∈ N ~ {n0},
ni must be of type DN.
V. DYNAMIC ORDER MULTI-FIELD INDEX (DOMI)
In order to handle data set efficiently and to provide some
optimizations for indexing of data set repository, dynamic
index structure should be used. This is achieved by building
the dynamic index based on a Radix tree as a basic structure.
This will help to retrieve the data quickly. According to the
work in this paper, a Dynamic Order Multi-field index
(DOMI) has been introduced as a new index structure to
support query processing of data efficiently. By using DOMI,
the search time , as well as, storage overhead will be reduced.
A dynamic index structure could be constructed using two
different data structures; SRT and Hash Table (HT). Hash
tables allow each SRT to be accessed randomly and
independently. HT provides a way to locate data in a constant
time. Each root node of SRTi is saved in the HT. HT consists
of several entries of the tuple (keyvalue). A key is the field
name and a value is a list of pointers each of which points to
the position of a root node in a radix tree.
The proposed data structure, dynamic ordered multi-field
index (DOMI), can find tuple data-set Row in the original data
source which match the query keyword as tuple query using
the techniques based on direct access of any field. Any
searching process starts by consulting the hash table and
locating pointers of the required RNs using the field names.
Fig 5. Hash Table (HT)
The design of a dynamic re-ordering multi-field index
(DOMI) for querying data sets is illustrated in Fig. 5. Each key
in a HT has one or more pointers to point to a root node of an
SRT. If one segment contains more than one SRT, as segment
2 in Fig. 4, in this case, each key has a list of pointers; FN2→
{P1, P2, …, Pn}. These pointers refer directly to the root node
of any segment that includes the desired values to answer the
queries.
Definition 4:
A HT is defined in the form as follows:
HT = ({FN1, FN2, …,FNn}, {P1, P2, …, Pn}), where
(1) FN is the set of RN for each segment,
(2) P is the set of pointers for each RN.
When input query q is given in the form of tuple T = (FN,
value), if FN of query matches with FN in HT, it returns the
query answer q according to {FNk, Pn⊆ SRTi, SRT∈ RT}.
The search process starts by comparing a field in the query
with a field name in the hash table. If they match, the search
follows a link which connected SRTi and HT to a particular
SRT, and the desired data is found. If there is no matching,
this indicates that a value does not exist, and the search
terminates.
When a tuple of query is received, it immediately moves
towards the HT to determine the suitable root at the right
segment, and then process the query accordingly.
A. Preliminaries
A binary search tree of height H can support any of the
basic dynamic-set operations such as SEARCH,
PREDECESSOR, SUCCESSOR, MINIMUM, MAXIMUM,
INSERT, and DELETE, in (h) time. This set operations
would be processed fast if the height of the search tree is
small. If the search tree height is large, processing these
operations may not be faster than a linked list [23].
On the other hands, hash tables support the dictionary
operations as INSERT, DELETE, and SEARCH. In the worst
case, the hashing process requires (n) time to Perform
SEARCH operation, but the expected time for hash-table
operations is (1) [23].
The complexity of Look-up, insert, and delete operations in
the worst case is ( ), where is the maximum length of the
string in the set [24]. The time complexity of the worst
operations (e.g., insert and look-up), where n is the number of
elements, l is the maximum length of the new key, using
different data structure is illustrated in Table 1.
Table1. Comparison between Index Structures
Index
Structure
Time complexity
B-tree (log n)
B+-tree (log n)
R-tree
Not utilize space more efficiently,
not have worst case time
complexity [19].
Radix Tree (l)
Hash Table (1)
Most of the index structures have time complexity in terms
of (log n). But they have different factor, terms and
condition when they use to develop algorithms [19]. On the
other hands, Radix trees have a number of interesting
properties that distinguish them from other search trees [9]:
• The height (and complexity) of radix trees depends on the
length of the keys not on the number of elements in the
tree.
• Radix trees don't need rebalancing operations and all
insertion orders result in the same tree.
• The keys are stored in lexicographic order.
• The path to a Data Node represents the key of that
leaf. Therefore, keys are stored implicitly and can be re-
constructed from paths
B. Insertion Algorithm
The pseudo code of a simple algorithm of insert operation to
insert values of tuples Data Set_Row in form (key/value) at the
segments of Radix Tree is presented in Fig. 6.
.
Fig 6. Simple Algorithm to insert (RN&DN)
C. Insertion Example
The insertion operation of values in two cases will be
explained. Fig. 7 illustrates the beginning of the segment1
key ’ST’, followed by the value ‘Nevada’ at the right side of
the root ST.
A new entry ‘ST=California’ is inserted to the upper
segment. For searching a new value, a null pointer leaves a
non-leaf node. Next, a node is created for value “California”
and it is inserted as Data Node (child) of root node “ST”
accordingly (line 2) in the algorithm. The DN of this value
can be artlessly inserted into an existing root node. It has a
new child now. At any time, a new entry of another tuple is
inserted to the SRT1. In this case, a new pointer for RN ‘ST’ is
not added in the hash table. For any existed DN in segment1
contains a prefix value pv, it is compared to a new value of the
entry. This node can be split and return sv for both values (line
3) in algorithm. The symbols (…..) in a node represent
numbers of offset that inserted together values.
Fig 7. Inserted new Data Node (case 1)
Note that a middle segment contains more than one SRT.
New value can be inserted in the middle segment where each
value is inserted based on the above SRT in all segments.
Segment2 contains more than one RN ‘City’. To insert a new
value, it must follow the path of SRT in the above segment.
By inserting a new entry ‘City=Los Angles’, it should append
a new DN to an existing RN2 of segment2. Create a node for
value “Los Angles” and add it as a child DN of the RN “City”.
Since no new RN was added to the index, therefore there is no
need to add anything to the HT. a new entry “Company
Name=A World Link” is inserted until a leaf node is reached.
But the stored value isn't the same as the new value. Now a
new path has to be generated as the common prefix. The old
value “A white Rose” and the new value “A World Link”
have the same prefix “A W” which branches out to two
different leaves each of which contains suffix for each value at
the end.
It can be noted that some segments of the DOMI did not
require the creation of RNs where the insertion process is
implemented on existing RNs “previously inserted”. In this
case, there is no need to add new pointers for RNs in a hash
Algorithm (Parameters V: pair of (v,{offset1,…, offseti})
SRTi: sub radix tree, pv: prefix value, v i,j: different
value, HT: hash table, Pi: new pointer)
1) IF N is a Root Node:
2) For each value V of TData set_Row , expand
new Leaf Node into SRTi , add V or,
3) IF Data Node conations pvj=pvi, split node
and insert sv for each v.
4) IF N equal null:
5) Add new Root Node inside the segment, then call
steps (2 or 3).
6) Add new pi of root node into HT.
table “no change to the hash table”. Therefore, the insertion
operation can be performed in (|V|) where V is the length of
the value to be inserted.
The insertion operation in case 2 is illustrated in Fig. 8. A
new DN is created to insert value ‘Ohio’ into segment1.
Fig 8. Inserted new Root Node (case 2)
In segment 2, a RN ‘City’ should be inserted first, and then a
value ‘Findlay’ can be inserted as a child node. In this case,
since there is no RN is existed under the DN ‘Ohio’; therefore,
a new RN ‘City’ should be created (see line 5). It is an
extension of a value directly above it and then a new DN is
created to insert value ‘Findlay’. The same steps are applied in
segment3. Segment 2 and segment 3 of the DOMI are required
to the creation of RNs. This needs to add new pointers for
RNs (‘City’, ‘compnay name’) in the hash table (see line 6).
The time complexity of the insertion operation in DOMI
depends on the time complexity of Multi-Segments Radix
Tree Index (MSRTI) and Hash Table (HT). It is required to
insert a new value into the appropriate DN and also insert RN
for this value in the HT.
Therefore, the time complexity could be determined as
follows:
l = max ( |V1| , |V2|, …. |Vn|), l is the length of the maximum
value.
MSRTI: (n l), where n is the number of segments which
equals to the number of fields and l is the length of the
maximum value
DOMI: (n l) + (1) = (n l) since (1) is negligible.
D. Algorithm Search
the algorithm for searching values in DOMI data structure is
presented in Fig. 9.
Fig 9. Simple Algorithm for search in DOMI
Usually, searching the trees must be descending from the
highest tree. Thus, more than one SRT to be searched might
be traversed.
E. Example search
Fig. 10 describes how to search DOMI for a specific value
stated in a given query. Given a query, the search begins from
the RNs nodes that are stored in the HT. It checks RNs until a
field name that matches the query field name is found. If there
is a matching, the search follows the direct link that refers to a
particular block-sized SRTi. Then, the search continues from
the RN of SRTi down to a DN.
The comparison between a key of Tquery and a RN of a HT is
illustrated in line 1 of Fig. 9. The following example can be
used to illustrate how to process queries using DOMI
structure.
A query Q is stated as “City= Garden Gove and Company
name= A white Rose”. It is immediately moving towards the
HT. If a ‘City’ and a ‘Company name’ of Tquery matches the
appropriate FNs in the HT, the search process follows pointers
of those accordance FNs. P1 in the pointers of ‘City’ which
points directly to the segment2, which includes “Garden Gove”
value according to the query without traversing the segment1.
P1 in the pointers of ‘Company name’ points directly to the
Input: Q = (k1, v1, k2, v2,) Tuples of query.
Output: all occurrences of Q in the Data Set.
RN: root node, p: pointer of RN
/*Search begins at the hash table (HT)*/
1) Check a ki of Tquery with a RNi of HT, if so.
2) A pi of a RNi in HT moves toward SRTi
3) Return output
4) Otherwise, If ki of Tquery ≠ a RNi of HT, then return null
segment3 without traversing both segment1 and segment2
sequentially. Segment3 includes the node that stores “A white
Rose” value. Then, the search continues from that node down
to a DN to reach the desired data. The leftmost DN of
segment3 represents the common prefixes for two values “A
white Rose” and “A world Link”.
The leaf DN contains pair of value and the position (offset)
of that value. It refers to a location of the actual spatial-data
within the input stream. If FN doesn't match the appropriate of
the search key, indicates that the key does not exist, and the
search terminates.
Fig 10. Search in DOMI
VI. CONCLUTIONS
The primarily environment of big data needs to use more
efficient index structures to speed up the evaluation of queries.
The work in this paper has introduced new index structure;
Dynamic Ordered Multi- Field Index (DOMI). The DOMI is
based on a collection of radix trees in addition to a single hash
table. The use of a hash table allows random access of any
sub-radix tree without traversing the upper trees at the upper
segments. In addition, the use of radix trees decreases the
space consumption by storing common prefix values only
once. Also, it provides efficient time complexity regarding the
insertion and searching operations. For these reasons, we
believe that the proposed DOMI offers an attractive alternative
approach compared to other structures for indexing forever-
growing big data.
REFERENCES
[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplied Data Processing on Large Clusters, 2004,
Google, Inc.
[2] Brian F. Cooper, Neal Sample, Michael J. Franklin1, Gísli
R. Hjaltason1, Moshe Shadmon1,” A Fast Index for Semi
structured Data”, Proceedings of the 27th VLDB
Conference ,Roma , Italy, 2001.
[3] Christophe Cérin, MichelKoskas, Jean-SébatienGay, Gaël
Le Mahec, “Efficient Data-Structures and Parallel
Algorithms for Association Rules Discovery”, Proceedings
of Fifth Mexican International Conference, in IEEE, 2004.
[4] Mining of Massive Datasets, Anand Rajaraman, Jure
Leskovec, Jeffrey D. Ullman, 2012.
[5] Andrew S. Tanenbaum Maarten Van Steen, “Distributed
Systems Principles and Paradigms”, 2007.
[6] V. Gaede and O. Gu¨ nther, “Multidimensional Access
Methods,” ACM Computing Surveys, vol. 30, no. 2, pp.
170-231, June 1998.
[7] Kevin McGowan, “Big data, Fast Processing Speeds”, In
SAS Solutions on Demand, Cary NC, 2013.
[8] Xiangwu Ding, Wenbing Yu, Jiajin Le, “An Adaptive
Projection Strategy and Its Implementation in Column
Stores”, in IEEE, 2011.
[9] Viktor Leis, Alfons Kemper, Thomas Neumann, “The
Adaptive Radix Tree: ARTful Indexing for Main-
Memory Databases”, ICDE, 2013.
[10] Goetz Graefe, “Efficient columnar storage in B-trees”, In
ACM, 2007.
[11] Mohammad M. Masud1, Jing Gao, Latifur Khan, Jiawei
Han, Bhavani Thuraisingham, “A Multi-partition Multi-
chunk Ensemble Technique to Classify Concept-Drifting
Data Streams”, In Springer-Verlag Berlin Heidelberg,
2009.
[12] K.Ramamohanarao, JohnW.Lloyd, “Dynamic Hashing
Schemes”, In ACM Computing Surveys, 1998.
[13] Per-Ake Larson,” Linear hashing with separators—a
dynamic hashing scheme achieving one-access”, In
ACM Transactions on Database Systems, 1988.
[14] Hannes Voigt, Tobias Jaekel, Thomas Kissinger,
Wolfgang Lehner, “Adaptive Index Buffer”, In 28th
International Conference on Data Engineering
Workshops, In IEEE, 2012.
[15] P. O’Neil, D. Quass, “Improved Query Performance with
Variant Indexes” In ACM SIGMOD international
conference on Management of data, page 38--49, 1997.
[16]J.Corbet, “Trees I: Radix trees,”
http://lwn.net/Articles/175432.
[17] B. Schlegel, R. Gemulla, W. Lehner, “k-ary search on
modern processors,” In DaMoN workshop, 2009.
[18] R.Bayer and E. McCreight, “Organization and
maintenance of large ordered indices,” in SIGFIDET,
1970.
[19] P. Patel, D Garg,” Comparison of Advance Tree Data
Structures”, in IJCA International Journal of Computer,
2012.
[20] Guojun Lu, “Techniques and Data Structures for
Efficient Multimedia Retrieval Based on Similarity”, In
IEEE, 2002.
[21] Lisa A. Horwitz, “Techniques for Managing Large Data
Sets: Compression, Indexing and Summarization”,
Applications, 2012.
[22] Ajit Singh, Dr. Deepak Garg "Implementation and
Performance Analysis of Exponential Tree Sorting"
International Journal of Computer Applications, pp. 34-
38 June 2011.
[23] Thomas H. Cormen, Charles E. Leiserson, Ronald L.
Rivest, Clifford Stein, “Introduction to Algorithms Third
Edition”, 2009.

Contenu connexe

Tendances

Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.docbutest
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...csandit
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingJaspreet Issaj
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningAM Publications,India
 
Development of pattern knowledge discovery framework using
Development of pattern knowledge discovery framework usingDevelopment of pattern knowledge discovery framework using
Development of pattern knowledge discovery framework usingIAEME Publication
 
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGSPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGijdms
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systemsMohammed Muqeet
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data FusionIRJET Journal
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 

Tendances (18)

Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.doc
 
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
 
Z36149154
Z36149154Z36149154
Z36149154
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
On multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and queryingOn multi dimensional cubes of census data: designing and querying
On multi dimensional cubes of census data: designing and querying
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data Mining
 
A1030105
A1030105A1030105
A1030105
 
Development of pattern knowledge discovery framework using
Development of pattern knowledge discovery framework usingDevelopment of pattern knowledge discovery framework using
Development of pattern knowledge discovery framework using
 
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGSPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systems
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
 
Cross Domain Data Fusion
Cross Domain Data FusionCross Domain Data Fusion
Cross Domain Data Fusion
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 

En vedette

Theory of Computation
Theory of ComputationTheory of Computation
Theory of ComputationShiraz316
 
Operating system notes
Operating system notesOperating system notes
Operating system notesSANTOSH RATH
 
Os solved question paper
Os solved question paperOs solved question paper
Os solved question paperAnkit Bhatnagar
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 028threspecter
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 018threspecter
 
Operating system concepts (notes)
Operating system concepts (notes)Operating system concepts (notes)
Operating system concepts (notes)Sohaib Danish
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...vtunotesbysree
 

En vedette (16)

File Organization & processing Mid term summer 2014 - modelanswer
File Organization & processing Mid term summer 2014 - modelanswerFile Organization & processing Mid term summer 2014 - modelanswer
File Organization & processing Mid term summer 2014 - modelanswer
 
Compilers midterm spring 2013 model answer
Compilers midterm spring 2013   model answerCompilers midterm spring 2013   model answer
Compilers midterm spring 2013 model answer
 
Final Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answersFinal Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answers
 
Model answer of exam TC_spring 2013
Model answer of exam TC_spring 2013Model answer of exam TC_spring 2013
Model answer of exam TC_spring 2013
 
Compilers Final spring 2013 model answer
 Compilers Final spring 2013 model answer Compilers Final spring 2013 model answer
Compilers Final spring 2013 model answer
 
Model answer of compilers june spring 2013
Model answer of compilers june spring 2013Model answer of compilers june spring 2013
Model answer of compilers june spring 2013
 
Theory of Computation
Theory of ComputationTheory of Computation
Theory of Computation
 
Theory of computation Lec1
Theory of computation Lec1Theory of computation Lec1
Theory of computation Lec1
 
Theory of computation Lec3 dfa
Theory of computation Lec3 dfaTheory of computation Lec3 dfa
Theory of computation Lec3 dfa
 
Os Question Bank
Os Question BankOs Question Bank
Os Question Bank
 
Operating system notes
Operating system notesOperating system notes
Operating system notes
 
Os solved question paper
Os solved question paperOs solved question paper
Os solved question paper
 
theory of computation lecture 02
theory of computation lecture 02theory of computation lecture 02
theory of computation lecture 02
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 01
 
Operating system concepts (notes)
Operating system concepts (notes)Operating system concepts (notes)
Operating system concepts (notes)
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
 

Similaire à Infos2014

Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseEditor IJMTER
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTcsandit
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementijdms
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemijitjournal
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage Systemijcnes
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentIJERA Editor
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a surveyssuser0191d4
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...journalBEEI
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexingijdms
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categoriestheijes
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 

Similaire à Infos2014 (20)

P341
P341P341
P341
 
Overview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented DatabaseOverview of Indexing In Object Oriented Database
Overview of Indexing In Object Oriented Database
 
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENTQUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
H017554148
H017554148H017554148
H017554148
 
Power Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage SystemPower Management in Micro grid Using Hybrid Energy Storage System
Power Management in Micro grid Using Hybrid Energy Storage System
 
Data Ware House System in Cloud Environment
Data Ware House System in Cloud EnvironmentData Ware House System in Cloud Environment
Data Ware House System in Cloud Environment
 
Big data service architecture: a survey
Big data service architecture: a surveyBig data service architecture: a survey
Big data service architecture: a survey
 
Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...Transforming data-centric eXtensible markup language into relational database...
Transforming data-centric eXtensible markup language into relational database...
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service IndexingA New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
 
Az31349353
Az31349353Az31349353
Az31349353
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 

Plus de Arab Open University and Cairo University

Plus de Arab Open University and Cairo University (20)

Theory of computation Lec6
Theory of computation Lec6Theory of computation Lec6
Theory of computation Lec6
 
Lec4
Lec4Lec4
Lec4
 
Theory of computation Lec2
Theory of computation Lec2Theory of computation Lec2
Theory of computation Lec2
 
Theory of computation Lec7 pda
Theory of computation Lec7 pdaTheory of computation Lec7 pda
Theory of computation Lec7 pda
 
Setup python with eclipse
Setup python with eclipseSetup python with eclipse
Setup python with eclipse
 
Cs419 lec8 top-down parsing
Cs419 lec8    top-down parsingCs419 lec8    top-down parsing
Cs419 lec8 top-down parsing
 
Cs419 lec11 bottom-up parsing
Cs419 lec11   bottom-up parsingCs419 lec11   bottom-up parsing
Cs419 lec11 bottom-up parsing
 
Cs419 lec12 semantic analyzer
Cs419 lec12  semantic analyzerCs419 lec12  semantic analyzer
Cs419 lec12 semantic analyzer
 
Cs419 lec9 constructing parsing table ll1
Cs419 lec9   constructing parsing table ll1Cs419 lec9   constructing parsing table ll1
Cs419 lec9 constructing parsing table ll1
 
Cs419 lec10 left recursion and left factoring
Cs419 lec10   left recursion and left factoringCs419 lec10   left recursion and left factoring
Cs419 lec10 left recursion and left factoring
 
Cs419 lec7 cfg
Cs419 lec7   cfgCs419 lec7   cfg
Cs419 lec7 cfg
 
Cs419 lec6 lexical analysis using nfa
Cs419 lec6   lexical analysis using nfaCs419 lec6   lexical analysis using nfa
Cs419 lec6 lexical analysis using nfa
 
Cs419 lec5 lexical analysis using dfa
Cs419 lec5   lexical analysis using dfaCs419 lec5   lexical analysis using dfa
Cs419 lec5 lexical analysis using dfa
 
Cs419 lec4 lexical analysis using re
Cs419 lec4   lexical analysis using reCs419 lec4   lexical analysis using re
Cs419 lec4 lexical analysis using re
 
Cs419 lec3 lexical analysis using re
Cs419 lec3   lexical analysis using reCs419 lec3   lexical analysis using re
Cs419 lec3 lexical analysis using re
 
Cs419 Compiler lec1&2 introduction
Cs419 Compiler lec1&2  introductionCs419 Compiler lec1&2  introduction
Cs419 Compiler lec1&2 introduction
 
CS215 - Lec 8 searching records
CS215 - Lec 8  searching recordsCS215 - Lec 8  searching records
CS215 - Lec 8 searching records
 
CS215 - Lec 7 managing records collection
CS215 - Lec 7  managing records collectionCS215 - Lec 7  managing records collection
CS215 - Lec 7 managing records collection
 
CS215 - Lec 6 record index
CS215 - Lec 6  record indexCS215 - Lec 6  record index
CS215 - Lec 6 record index
 
CS215 - Lec 5 record organization
CS215 - Lec 5  record organizationCS215 - Lec 5  record organization
CS215 - Lec 5 record organization
 

Dernier

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptesrabilgic2
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 

Dernier (20)

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Lecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).pptLecture # 8 software design and architecture (SDA).ppt
Lecture # 8 software design and architecture (SDA).ppt
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 

Infos2014

  • 1. Information Retrieval using Dynamic Indexing Sura I. Mohammed Computer science department Faculty of computer science Cairo University Egypt, Cairo suib200684@yahoo.com Hussien M. Sharaf ITC department Arab Open University Egypt, Cairo hussiensharaf@from-masr.com Fatma A. Omara Computer science department Faculty of computer science Cairo University Egypt, Cairo f.omara@fci-cu.edu.eg Abstract—Since the demand for information retrieval increases quickly, indexing structures became an important issue to support fast information retrieval. According to the work in this paper, a new data structure called Dynamic Ordered Multi-field Index (DOMI) for information retrieval has been introduced. It is based on radix trees organized in segments in addition to a hash table to point to the roots of each segment, where each segment is dedicated to store the values of a single field. The hash table is used to access the needed segments directly without traversing the upper segments. So, DOMI improves look-up performance for queries addressing to a single field. In the case of multiple queries addressing, each segment of the radix tree is traversed sequentially without visiting the unrelated branches. The use of segmentation for the proposed DOMI provides flexibility for minimizing communication overhead in the distributed system. Every field in the radix tree is represented by one segment, where each segment can be stored as one block. In addition to, the proposed DOMI consumes less space comparing to indexes which are built using B or B+ trees. Hence, it is more suitable for intensive-data such as Big Data. Keywords: Dynamic Order Multi-field Index (DOMI), Data query, indexing structures, Big Data. I. INTRODUCTION As the size of data sets grows bigger, some research works had been done to provide flexible and efficient query mechanisms to extract data from a set of fields of data streaming. There are several existed data structures that can be used for this task as B and B+. Particularly, index mechanisms are needed to help the retrieval information quickly and according to their location within large Data sets where the Main objective of indexing is to optimize the speed of query [22]. However, there are two major phases for supporting dynamic queries through a large data set. The first phase is preparing data sets. The second phase is to give an answer to a query after using indexing for data that result from one phase. The second phase will be considered in this paper. An index is considered an efficient data structure to retrieve objects by given the value of one or more elements of those objects [4]. This scales gracefully to large numbers of keys and insensitive to the length or content of inserted strings [2]. Data processing based on queries can produce useful information. However, answers to such queries reflect the stored information directly and avoid searching unnecessary field values. The basic idea of field’s index is to ignore unnecessary fields during processing queries. The logical unit of a tuple as entry in field index is an important where each entry index is a tuple with order which is decided by the Index designer. Therefore, Tuple affects directly the performance of query execution. Predilection on processing queries over large Data set has contributed in the improvement of index structure, such as projections strategy indices (RB+), Value-List Index (B+ tree) and more complex index structures to speed up the process of queries evaluation [15]. While efficient query processing specifications have been achieved from these indices, querying huge data sets, especially streaming data, had suffered from the time overheads during answering the queries that involve multiple fields. That problem is happened mainly in traditional search Trees, because of the sequential access (i.e., static order). By using static order index, the search should traverse all values in each entry to find matching values for the required field which might be a non-leading field. In such case, the search is nearly a sequential search. The problem of using static order index for answering queries which involves multiple fields index, i.e., single index can reference multiple field (e.g., State, City, Zipcode, Web Site), is illustrated Fig 1. According to Fig. 1, the index will not be helpful when the query is processed using State and Zip fields because unnecessary fields will be searched to answer queries. The only field that benefits from the index structure is the leading field while the rest of the fields are nearly unsorted.
  • 2. Fig 1. Single Multi-field Index According to the work in this paper, multi-field searching problem using static ordering of the large data set has been resolved by introducing a dynamic ordered index for holding values of multiple fields based on the Radix tree (DOMI) as a basic structure. The mean advantages of Radix tree relative to other trees such as B/B+ are that efficient in terms of storage, fast on look-up operations, and the data is maintained in sorted order [16] [9]. Also it supports other operations such as prefix lookup and row update. The dynamic property of the proposed DOMI structure has been satisfied by using a hash-table together with Radix trees. It allows a search to proceed directly to a root field. Hence it helps the query to access the data items in parallel with reasonable time. On the other hands, the dynamic index allows a search to proceed directly to a target data portion of the index that can answer a query. The remainder of this paper is organized as follows; Section II discusses related work. Section III starts with some general observations on the Radix trees advantages over B/B+ tree, and how to build index of fields using Radix tree. Section IV presents the building multi-field using radix tree. Section V presents the proposed DOMI structure and how to build index where order of fields can dynamically change according to the given query. Finally, section VI presents the conclusions of the paper. II. RELATED WORK More organizations are running into problems with processing big data every day [7]. In real life, data stores contain millions of data for real world objects and the data searching is most common and is always used to retrieval of data. So, to improve the performance of retrieving data, data indexing is required [19]. There are problems for indexing huge amount of digital information. The problem of storing, indexing and searching data set has gained increasing attention. A survey about spatial indexing is discussed in [6]. According to the survey in [6], the problem of designing efficient indexes to support spatial objects has been addressed. On the other hands, the reason for creating an index for a data set is to speed the access to a subset of the data [21]. Index structures are different in terms of structure, query support, data type support and application [19]. Tuple reconstruction is an important component in the column-stores, and affects the performance of query execution. Therefore, it is necessary to perform the process of Tuple reconstruction before query execution by using main indexes and jointing address mapping indexes [8]. Radix tree has property a key prefix that allow efficiently index in main memory. Query Processing in QPPT keeps the index materialization costs low, and uses optimal prefix trees to satisfy balanced read/write operations which are known to be main memory optimized [12]. On the other hands, the Adaptive Index Buffer reduces the cost of table scans by quickly indexing tuples in the memory until the partial index to be adapted to the workload again, but it covers only a subset of the values of a column [14]. Based on data structure to store tuples of fields, B-tree is considered a simple existed method that permits storing vertical partitions in traditional B-tree indexes with practically zero overheads for storing the tuples [10]. Even so, a B tree is usually used to search tuples of fields in a static order. By given a query, the search begins from the root, and checks each child sequentially until the query is found. According to the work in this paper, a radix tree is used to store field values based on the characteristic adaptive radix tree by using dynamically choosing compact internal data structures to overcome the common problem in the worst-case space consumption and enable efficient parallel access [9]. The Radix tree saves storage space by exploiting the common prefixes in the string set. Index Fabric based on B-tree uses a segmented approach of Patricia to allow a search to proceed directly to a block-sized portion of the index that can answer a query [2]. According to index fabric, the search proceeds from segment to segment until the desired data segment is found. Comparing to the existed approaches, using Radix tree to build indexes is considered very helpful especially in a stream Data querying. Efficient Search is considered the most important criterion for selecting data structures because search is normally carried out on-line (and thus needs quick response) and will be carried out many times [20]. Traditionally, it uses auxiliary data structures, such as B-Trees, Hash Indexes, and Bitmap Indexes. These data structures have excellent performance. The indices are used to provide a quick and easy access to data, save time and operations in searching, inserting of data, etc.
  • 3. This paper intends to use two different data structure for index construction. III. RADIX TREE (RT) OVERVIEW. Using Radix tree provides the advantages of reduction in the storage space required for storing values. It also provides great efficiency to retrieve any information [3]. Regarding object storage, the radix tree uses a simple key/value model depending on the characteristics of radix trees. It also enables parallel access of sub-radix trees Generally, a Radix tree is a hierarchical structure composed of internal nodes and leaf nodes, where [3]: • Internal Nodes; contain pairs of the form (key, P). An entry in an internal node contains a pointer (P) pointing to a lower level node in the sub-tree and a key is the field name. Structure of inner node for this tree capacity where values can contain more complex data types. • Leaf nodes; store the values corresponding to the keys. The useful properties of Radix tree are [13]: • Look-up; determines if a string exists in a tree. • Insertion; either add a new outgoing edge labeled with all remaining elements in the input string, or find longest common prefix, split it into two edges then add suffix. • In addition to the ordering of the keys that are sorted lexicographically. It supports another operations e.g. (rang scan, prefix lookup, update). With respect to the performance of operations, k-ary search trees fail to support incremental update operations [17]. And B+ trees have expensive update operations [18]. On the other hands, Radix trees doesn’t have such expensive update operations because they need minimal re-structuring of nodes compared to B+ trees which need time consuming insertion algorithm. . One advantage of the Radix tree is that it depicts early if there are possible matches. While other search trees such as binary tree, the decision will probably be the slowest because it has to search through levels of tree nodes, then, the result of comparisons cannot be predicted easily [9]. Finally the reasons of using a Radix tree are that it provides faster look-up, efficient insertions, and updates, supports range scans and prefix look-ups as the data is sorted. IV. BUILDING INDEX FILED USING RADIX TREE (RT). In this section, using the radix trees to build convenient indexes and for storing and answering a query will be discussed. Queries can be processed efficiently with specific indexes. Internal nodes are used as index to insert and locate data efficiently from the radix tree with minimum time. There are two types of nodes: • The Root Node (RN) such as (FN1, FN2… FNk), where each field FNi belongs to the set of a field's header in the original data. Each RN is used as a root for a Sub-Radix Trees (SRT). Each internal node stores one element only. • The Data Node (DN) contains two issues; The prefix value pv which is a common prefix for two values v that have the same prefix. This saves space for two inner nodes by truncating the path to the leaf. DN can be described in form of a tuple (V, {offset}) where value V has two type: • v is a complete value that exists in the original data file . • sv is a suffix value for that prefix value of the parent node. An entry of Data Node consists of a pointer pointing to the data and an offset or set of offsets which covers the item’s location in original file. Fig. 2 shows the structure of both Root Node and Data Node in the Radix Tree. In this paper, the radix tree is built on qualifying more set of fields. A candidate index in the data set is decided according to a certain criteria, and also according to the relationships between the fields themselves, and additional optimal number of fields. It is necessary to perform a pre- process of fields before index construction. A possible pre- processing is organizing values of each record in the form of tuples. Fig 2. Single node of tree Definition 1: A RN is defined in the form of the following:- RN = (FN, {p1,p2,…pi}), where FN indicates a field name in the original data. Pi points to another type Node of the tree. Definition 2: A DN is defined using three forms as follows:- • DN= (v, {o1,o2,..,oi}, { p1,p2,…pi }), where v is a complete value from the domain of the values to be indexed. o is an offset or position of this value in original data set. P
  • 4. points to a node which could be a DN in the same SRT or a RN in a new SRT. • DN= (pv, { p1,p2,…pi }), where pv is a common prefix for two values v that have the same prefix. P points to a node which conation suffix value. • DN= (sv, {o1,o2,..,oi}, { p1,p2,…pi }), where sv is suffix value for that prefix value of parent node. p points to a node which could be a DN in the same SRT or a RN in a new SRT. The values of each field are grouped into a segment that may contain one or more Sub-Radix Trees (SRT). The order of segments is initially decided according to the relationships between the fields. The increasing segments {1, 2, …, n} from the highest to the lowest is illustrated in Fig 3.. Each segment contains one or more block-sized sub-radix tree (SRTi) as shown in Fig 4. The Root Node (RN) of one SRT stores a key as (FN1) and each one of Data Nodes (DN) refers to a value that belongs to FN1.. Fig 3. Segments in Radix Tree One of the values as (v1FN1,…), is inserted by either, adding a new node that stores a complete value as shown in Fig 3. The other way is to find the longest common prefix as (pv2FN1) then add another node to hold the remaining suffix values. Storing a common prefix value only once saves storage space. In this case, a different suffix is stored in separate nodes(sv2.1FN1, sv2.2FN1), and so on for each value. The Radix Tree (RT) indexes stores values (v1FN1, sv2.2FN1), together with the position (offset) of that value that refers to a location of the actual spatial-data they represent. Multiple values in can reference same offset. The search is done by comparing each tuple generated by a user’s query which is coming in the form of tuple query (field, value, operator) where the operator could be equal, greater than or etc... with each tuple of the original data set which should have its tuple data-set Row (field, value, Data Type). The search process descends from the root at the highest segment and proceeds nodes within the same segment, or transferred them to the next segment, where the result of the search in one segment is either a pointer to data - if the search key matches the data key, or a pointer to another segment, or null. Thus each query may require passing more than one segment to find the answer of query to any search process. More than one SRT under a visited node may need to be searched; hence it might not possible to guarantee good performance. To avoid a sequential search on the whole tree, a dynamic index could provide a good solution. The dynamic index is based on radix tree which will help to retrieve data quickly according to their locations and require visiting only a small number of nodes. A single segment where sub-radix trees (SRTs) are grouped together is shown in Fig. 4. Each SRT has a root node that stores the field name. Fig 4. Sub Trees (ST) in one segment Definition 3: A SRT is defined in the form according to the following: SRT = (N, E, n0) is an acyclic graph where: • N is a set of nodes {n0,….,nk}, where k > 0, n0 ∈ N, • E is the set of links {e0,…., em }; where m>=0 and eij is a pair (ni, nj) such that ni ∈ N and nj∈ N.∀ pair (ni, nj) ∈ E; ni ≠ nj, • Finally; n0 is the only RN in a single SRT. ∀ni ∈ N ~ {n0}, ni must be of type DN.
  • 5. V. DYNAMIC ORDER MULTI-FIELD INDEX (DOMI) In order to handle data set efficiently and to provide some optimizations for indexing of data set repository, dynamic index structure should be used. This is achieved by building the dynamic index based on a Radix tree as a basic structure. This will help to retrieve the data quickly. According to the work in this paper, a Dynamic Order Multi-field index (DOMI) has been introduced as a new index structure to support query processing of data efficiently. By using DOMI, the search time , as well as, storage overhead will be reduced. A dynamic index structure could be constructed using two different data structures; SRT and Hash Table (HT). Hash tables allow each SRT to be accessed randomly and independently. HT provides a way to locate data in a constant time. Each root node of SRTi is saved in the HT. HT consists of several entries of the tuple (keyvalue). A key is the field name and a value is a list of pointers each of which points to the position of a root node in a radix tree. The proposed data structure, dynamic ordered multi-field index (DOMI), can find tuple data-set Row in the original data source which match the query keyword as tuple query using the techniques based on direct access of any field. Any searching process starts by consulting the hash table and locating pointers of the required RNs using the field names. Fig 5. Hash Table (HT) The design of a dynamic re-ordering multi-field index (DOMI) for querying data sets is illustrated in Fig. 5. Each key in a HT has one or more pointers to point to a root node of an SRT. If one segment contains more than one SRT, as segment 2 in Fig. 4, in this case, each key has a list of pointers; FN2→ {P1, P2, …, Pn}. These pointers refer directly to the root node of any segment that includes the desired values to answer the queries. Definition 4: A HT is defined in the form as follows: HT = ({FN1, FN2, …,FNn}, {P1, P2, …, Pn}), where (1) FN is the set of RN for each segment, (2) P is the set of pointers for each RN. When input query q is given in the form of tuple T = (FN, value), if FN of query matches with FN in HT, it returns the query answer q according to {FNk, Pn⊆ SRTi, SRT∈ RT}. The search process starts by comparing a field in the query with a field name in the hash table. If they match, the search follows a link which connected SRTi and HT to a particular SRT, and the desired data is found. If there is no matching, this indicates that a value does not exist, and the search terminates. When a tuple of query is received, it immediately moves towards the HT to determine the suitable root at the right segment, and then process the query accordingly. A. Preliminaries A binary search tree of height H can support any of the basic dynamic-set operations such as SEARCH, PREDECESSOR, SUCCESSOR, MINIMUM, MAXIMUM, INSERT, and DELETE, in (h) time. This set operations would be processed fast if the height of the search tree is small. If the search tree height is large, processing these operations may not be faster than a linked list [23]. On the other hands, hash tables support the dictionary operations as INSERT, DELETE, and SEARCH. In the worst case, the hashing process requires (n) time to Perform SEARCH operation, but the expected time for hash-table operations is (1) [23]. The complexity of Look-up, insert, and delete operations in the worst case is ( ), where is the maximum length of the string in the set [24]. The time complexity of the worst operations (e.g., insert and look-up), where n is the number of elements, l is the maximum length of the new key, using different data structure is illustrated in Table 1. Table1. Comparison between Index Structures Index Structure Time complexity B-tree (log n) B+-tree (log n) R-tree Not utilize space more efficiently, not have worst case time complexity [19]. Radix Tree (l) Hash Table (1)
  • 6. Most of the index structures have time complexity in terms of (log n). But they have different factor, terms and condition when they use to develop algorithms [19]. On the other hands, Radix trees have a number of interesting properties that distinguish them from other search trees [9]: • The height (and complexity) of radix trees depends on the length of the keys not on the number of elements in the tree. • Radix trees don't need rebalancing operations and all insertion orders result in the same tree. • The keys are stored in lexicographic order. • The path to a Data Node represents the key of that leaf. Therefore, keys are stored implicitly and can be re- constructed from paths B. Insertion Algorithm The pseudo code of a simple algorithm of insert operation to insert values of tuples Data Set_Row in form (key/value) at the segments of Radix Tree is presented in Fig. 6. . Fig 6. Simple Algorithm to insert (RN&DN) C. Insertion Example The insertion operation of values in two cases will be explained. Fig. 7 illustrates the beginning of the segment1 key ’ST’, followed by the value ‘Nevada’ at the right side of the root ST. A new entry ‘ST=California’ is inserted to the upper segment. For searching a new value, a null pointer leaves a non-leaf node. Next, a node is created for value “California” and it is inserted as Data Node (child) of root node “ST” accordingly (line 2) in the algorithm. The DN of this value can be artlessly inserted into an existing root node. It has a new child now. At any time, a new entry of another tuple is inserted to the SRT1. In this case, a new pointer for RN ‘ST’ is not added in the hash table. For any existed DN in segment1 contains a prefix value pv, it is compared to a new value of the entry. This node can be split and return sv for both values (line 3) in algorithm. The symbols (…..) in a node represent numbers of offset that inserted together values. Fig 7. Inserted new Data Node (case 1) Note that a middle segment contains more than one SRT. New value can be inserted in the middle segment where each value is inserted based on the above SRT in all segments. Segment2 contains more than one RN ‘City’. To insert a new value, it must follow the path of SRT in the above segment. By inserting a new entry ‘City=Los Angles’, it should append a new DN to an existing RN2 of segment2. Create a node for value “Los Angles” and add it as a child DN of the RN “City”. Since no new RN was added to the index, therefore there is no need to add anything to the HT. a new entry “Company Name=A World Link” is inserted until a leaf node is reached. But the stored value isn't the same as the new value. Now a new path has to be generated as the common prefix. The old value “A white Rose” and the new value “A World Link” have the same prefix “A W” which branches out to two different leaves each of which contains suffix for each value at the end. It can be noted that some segments of the DOMI did not require the creation of RNs where the insertion process is implemented on existing RNs “previously inserted”. In this case, there is no need to add new pointers for RNs in a hash Algorithm (Parameters V: pair of (v,{offset1,…, offseti}) SRTi: sub radix tree, pv: prefix value, v i,j: different value, HT: hash table, Pi: new pointer) 1) IF N is a Root Node: 2) For each value V of TData set_Row , expand new Leaf Node into SRTi , add V or, 3) IF Data Node conations pvj=pvi, split node and insert sv for each v. 4) IF N equal null: 5) Add new Root Node inside the segment, then call steps (2 or 3). 6) Add new pi of root node into HT.
  • 7. table “no change to the hash table”. Therefore, the insertion operation can be performed in (|V|) where V is the length of the value to be inserted. The insertion operation in case 2 is illustrated in Fig. 8. A new DN is created to insert value ‘Ohio’ into segment1. Fig 8. Inserted new Root Node (case 2) In segment 2, a RN ‘City’ should be inserted first, and then a value ‘Findlay’ can be inserted as a child node. In this case, since there is no RN is existed under the DN ‘Ohio’; therefore, a new RN ‘City’ should be created (see line 5). It is an extension of a value directly above it and then a new DN is created to insert value ‘Findlay’. The same steps are applied in segment3. Segment 2 and segment 3 of the DOMI are required to the creation of RNs. This needs to add new pointers for RNs (‘City’, ‘compnay name’) in the hash table (see line 6). The time complexity of the insertion operation in DOMI depends on the time complexity of Multi-Segments Radix Tree Index (MSRTI) and Hash Table (HT). It is required to insert a new value into the appropriate DN and also insert RN for this value in the HT. Therefore, the time complexity could be determined as follows: l = max ( |V1| , |V2|, …. |Vn|), l is the length of the maximum value. MSRTI: (n l), where n is the number of segments which equals to the number of fields and l is the length of the maximum value DOMI: (n l) + (1) = (n l) since (1) is negligible. D. Algorithm Search the algorithm for searching values in DOMI data structure is presented in Fig. 9. Fig 9. Simple Algorithm for search in DOMI Usually, searching the trees must be descending from the highest tree. Thus, more than one SRT to be searched might be traversed. E. Example search Fig. 10 describes how to search DOMI for a specific value stated in a given query. Given a query, the search begins from the RNs nodes that are stored in the HT. It checks RNs until a field name that matches the query field name is found. If there is a matching, the search follows the direct link that refers to a particular block-sized SRTi. Then, the search continues from the RN of SRTi down to a DN. The comparison between a key of Tquery and a RN of a HT is illustrated in line 1 of Fig. 9. The following example can be used to illustrate how to process queries using DOMI structure. A query Q is stated as “City= Garden Gove and Company name= A white Rose”. It is immediately moving towards the HT. If a ‘City’ and a ‘Company name’ of Tquery matches the appropriate FNs in the HT, the search process follows pointers of those accordance FNs. P1 in the pointers of ‘City’ which points directly to the segment2, which includes “Garden Gove” value according to the query without traversing the segment1. P1 in the pointers of ‘Company name’ points directly to the Input: Q = (k1, v1, k2, v2,) Tuples of query. Output: all occurrences of Q in the Data Set. RN: root node, p: pointer of RN /*Search begins at the hash table (HT)*/ 1) Check a ki of Tquery with a RNi of HT, if so. 2) A pi of a RNi in HT moves toward SRTi 3) Return output 4) Otherwise, If ki of Tquery ≠ a RNi of HT, then return null
  • 8. segment3 without traversing both segment1 and segment2 sequentially. Segment3 includes the node that stores “A white Rose” value. Then, the search continues from that node down to a DN to reach the desired data. The leftmost DN of segment3 represents the common prefixes for two values “A white Rose” and “A world Link”. The leaf DN contains pair of value and the position (offset) of that value. It refers to a location of the actual spatial-data within the input stream. If FN doesn't match the appropriate of the search key, indicates that the key does not exist, and the search terminates. Fig 10. Search in DOMI VI. CONCLUTIONS The primarily environment of big data needs to use more efficient index structures to speed up the evaluation of queries. The work in this paper has introduced new index structure; Dynamic Ordered Multi- Field Index (DOMI). The DOMI is based on a collection of radix trees in addition to a single hash table. The use of a hash table allows random access of any sub-radix tree without traversing the upper trees at the upper segments. In addition, the use of radix trees decreases the space consumption by storing common prefix values only once. Also, it provides efficient time complexity regarding the insertion and searching operations. For these reasons, we believe that the proposed DOMI offers an attractive alternative approach compared to other structures for indexing forever- growing big data. REFERENCES [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplied Data Processing on Large Clusters, 2004, Google, Inc. [2] Brian F. Cooper, Neal Sample, Michael J. Franklin1, Gísli R. Hjaltason1, Moshe Shadmon1,” A Fast Index for Semi structured Data”, Proceedings of the 27th VLDB Conference ,Roma , Italy, 2001. [3] Christophe Cérin, MichelKoskas, Jean-SébatienGay, Gaël Le Mahec, “Efficient Data-Structures and Parallel Algorithms for Association Rules Discovery”, Proceedings of Fifth Mexican International Conference, in IEEE, 2004. [4] Mining of Massive Datasets, Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman, 2012. [5] Andrew S. Tanenbaum Maarten Van Steen, “Distributed Systems Principles and Paradigms”, 2007. [6] V. Gaede and O. Gu¨ nther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, June 1998. [7] Kevin McGowan, “Big data, Fast Processing Speeds”, In SAS Solutions on Demand, Cary NC, 2013. [8] Xiangwu Ding, Wenbing Yu, Jiajin Le, “An Adaptive Projection Strategy and Its Implementation in Column Stores”, in IEEE, 2011. [9] Viktor Leis, Alfons Kemper, Thomas Neumann, “The Adaptive Radix Tree: ARTful Indexing for Main- Memory Databases”, ICDE, 2013. [10] Goetz Graefe, “Efficient columnar storage in B-trees”, In ACM, 2007. [11] Mohammad M. Masud1, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham, “A Multi-partition Multi- chunk Ensemble Technique to Classify Concept-Drifting Data Streams”, In Springer-Verlag Berlin Heidelberg, 2009. [12] K.Ramamohanarao, JohnW.Lloyd, “Dynamic Hashing Schemes”, In ACM Computing Surveys, 1998. [13] Per-Ake Larson,” Linear hashing with separators—a dynamic hashing scheme achieving one-access”, In ACM Transactions on Database Systems, 1988. [14] Hannes Voigt, Tobias Jaekel, Thomas Kissinger, Wolfgang Lehner, “Adaptive Index Buffer”, In 28th International Conference on Data Engineering Workshops, In IEEE, 2012. [15] P. O’Neil, D. Quass, “Improved Query Performance with Variant Indexes” In ACM SIGMOD international conference on Management of data, page 38--49, 1997.
  • 9. [16]J.Corbet, “Trees I: Radix trees,” http://lwn.net/Articles/175432. [17] B. Schlegel, R. Gemulla, W. Lehner, “k-ary search on modern processors,” In DaMoN workshop, 2009. [18] R.Bayer and E. McCreight, “Organization and maintenance of large ordered indices,” in SIGFIDET, 1970. [19] P. Patel, D Garg,” Comparison of Advance Tree Data Structures”, in IJCA International Journal of Computer, 2012. [20] Guojun Lu, “Techniques and Data Structures for Efficient Multimedia Retrieval Based on Similarity”, In IEEE, 2002. [21] Lisa A. Horwitz, “Techniques for Managing Large Data Sets: Compression, Indexing and Summarization”, Applications, 2012. [22] Ajit Singh, Dr. Deepak Garg "Implementation and Performance Analysis of Exponential Tree Sorting" International Journal of Computer Applications, pp. 34- 38 June 2011. [23] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, “Introduction to Algorithms Third Edition”, 2009.