SlideShare une entreprise Scribd logo
1  sur  10
M-tree:     An Efficient Access Method
                        for Similarity   Search in Metric Spaces

                                                                Marco Patella
               Pa010 Ciaccia                                                                                Pave1 Zezula
          DEIS - CSITE-CNR                                 DEIS - CSITE-CNR                                 CNUCE-CNR
                                                              Bologna,            Italy                      Piss, Italy
            Bologna, Italy
     pciaccia@deis.unibo.it                            mpatella@deis               . unibo . it       zezula@iei.pi.cnr.it




                                                                                increased and resulted in the development of multi-
                                                                                media database systems aiming at a uniform manage-
                           Abstract                                             ment of voice, video, image, text, and numerical data.
                                                                                Among the many research challenges which the multi-
    A new access method, called M-tree, is pro-                                 media technology entails - including data placement,
    posed to organize and search large data sets                                presentation, synchronization, etc. - content-based re-
                                                                                trieval plays a dominant role. In order to satisfy the
    from a generic “metric space”, i.e. where ob-
                                                                                information needs of users, it is of vital importance to
    ject proximity is only defined by a distance
                                                                                effectively and efficiently support the retrieval process
    function satisfying the positivity, symmetry,
                                                                                devised to determine which portions of the database
     and triangle inequality postulates. We detail
                                                                                are relevant to users’ requests.
     algorithms for insertion of objects and split
    management, which keep the M-tree always                                         In particular, there is an urgent need of index-
     balanced - several heuristic split alternatives                            ing techniques able to support execution of similarity
     are considered and experimentally evaluated.                                queries. Since multimedia applications typically re-
     Algorithms for similarity (range and k-nearest                             quire complex distance functions to quantify similari-
                                                  Re-
     neighbors) queries are also described.                                     ties of multi-dimensional   features, such as shape, tex-
     sults from extensive experimentation with a                                ture, color [FEF+94], image patterns [VM95], sound
     prototype system are reported, considering as                               [WBKW96], text, fuzzy values, set values [HNP95], se-
     the performance criteria the number of page                                quence data [AFS93, FRM94], etc., multi-dimensional
     I/O’s and the number of distance computa-                                   (spatial) access methods (SAMs), such as R-tree
     tions. The results demonstrate that the M-                                  [Gut841 and its variants [SRF87, BKSSSO], have been
     tree indeed extends the domain of applica-                                 considered to index such data. However, the applica-
     bility beyond the traditional vector spaces,                               bility of SAMs is limited by the following assumptions
     performs reasonably well in high-dimensional                               which such structures rely on:
     data spaces, and scales well in case of growing
                                                                                  1. objects are, for indexing purposes, to be repre-
     files.
                                                                                     sented by means of feature values in a multi-
                                                                                     dimensional vector space;
1     Introduction
                                                                                 2. the (dis)similarity of any two objects has to be
Recently, the need to manage various types of data
                                                                                    based on a distance function which does not in-
stored in large computer repositories has drastically
                                                                                    troduce any correlation (or “cross-talk”) between
                                                                                    feature values [FEF+94]. More precisely, an L,
              to copy without fee all or part of this material
Permission                                                           is
                                                                                    metric, such as the Euclidean distance, has to be
granted provided that the copies are not made or distributed for
direct commercial advantage,                   copyright notice and
                                  the VLDB
                                                                                    used.
the title of the publication  and its date appear,      and notice is
given that copying is by permission of the Very Large Data Base
                                                                                Furthermore, from a performance point of view, SAMs
                To copy otherwise,                      requires a fee
Endowment.                           or to republish,
                                                                                assume that comparison of keys (feature values) is a
and/or special permission from the Endowment.
                                                                                trivial operation with respect to the cost of accessing a
Proceedings  of the 23rd VLDB                Conference
                                                                                disk page, which is not always the case in multimedia
Athens, Greece, 1997



                                                                          426
applications. Consequently, no attempt in the design               cific metric distance function d.
 of these structures has been done to reduce the number                  Formally, a metric space is a pair, M = (D,d),
 of distance computations.                                           where V is a domain of feature values - the indexing
                                                                     keys - and d is a total (distance) function with the
      A more general approach to the “similarity index-
 ing” problem has gained some popularity in recent                   following properties:l
 years, leading to the development of so-called metric
                                                                                                                           (symmetry)
                                                                       1. d(O,, 0,) = d(C),, 0,)
  trees (see [UhlSl]). M et ric trees only consider relative
 distances of objects (rather than their absolute posi-                2. d(O,, 0,) > 0 (0, # 0,) and d(O,, 0,)            =0
 tions in a multi-dimensional       space) to organize and                                                          (non negativity)
 partition the search space, and just require that the
                                                                       3. d(O,, 0,) 5 4%          0,) + d(Oz,O,)
 function used to measure the distance (dissimilarity)
                                                                                                               (triangle    inequality)
 between objects is a metric (see Section 2), so that the
 triangle inequality property applies and can be used to             In principle, there are two basic types of similarity
 prune the search space.                                             queries: the range query and the k nearest neighbors
      Although the effectiveness of metric trees has been            query.
 clearly demonstrated [Chi94, Bri95, B097], current de-
                                                                     Definition   2.1 (Range)
 signs suffer from being intrinsically static, which limits
                                                                      Given a query object Q E 2) and a maximum search
 their applicability in dynamic database environments.
                                                                     distance r(Q), the range query range(Q, r(Q)) selects
 Contrary to SAMs, known metric trees have only tried
                                                                     all indexed objects Oj such that d(Oj, Q) 5 r(Q).  q
 to reduce the number of distance computations re-
quired to answer a query, paying no attention to I/O                 Definition  2.2 (k nearest neighbors     (LNN))
costs.                                                                Given a query object Q E 2) and an integer k 2 1,
      In this article, we introduce a paged metric tree,             the k-NN query NN(&, k) selects the k indexed objects
called M-tree, which has been explicitly designed to                                                                     0
                                                                     which have the shortest distance from Q.
be integrated with other access methods in database
systems. We demonstrate such possibility by imple-                   There have already been some attempts to tackle the
menting M-tree in the GiST (Generalized Search Tree)                 difficult metric space indexing problem. The FastMap
 [HNP95] framework, which allows specific access meth-                algorithm [FL951 transforms a matrix of pairwise dis-
ods to be added to an extensible database system.                    tances into a set of low-dimensional points, which can
                                                                     then be indexed by a SAM. However, FastMap as-
      The M-tree is a balanced tree, able to deal with
                                                                     sumes a static data set and introduces approximation
dynamic data files, and as such it does not require
                                                                     errors in the mapping process. The Vantage Point
periodical reorganizations.       M-tree can index objects
                                                                      (VP) tree [Chi94] partitions a data set according to
using features compared by distance functions which
                                                                     distances the objects have with respect to a reference
either do not fit into a vector space or do not use an L,
                                                                      (vantage) point. The median value of such distances
metric, thus considerably extends the cases for which
                                                                     is used as a separator to partition objects into two
efficient query processing is possible. Since the design
                                                                     balanced subsets, to which the same procedure can re-
of M-tree is inspired by both principles of metric trees
                                                                     cursively be applied. The MVP-tree [B097] extends
and database access methods, performance optimiza-
                                                                     this idea by using multiple vantage points, and exploits
tion concerns both CPU (distance computations) and
                                                                     pre-computed distances to reduce the number of dis-
I/O costs.
                                                                     tance computations at query time. The GNAT de-
     After providing some preliminary background in
                                                                     sign [Bri95] applies a different - so-called generalized
Section 2, Section 3 introduces the basic M-tree prin-
                                                                                   [Uhlgl] - partitioning style. In the basic
                                                                     hyperplane
ciples and algorithms.       In Section 4, we discuss the
                                                                     case, two reference objects are chosen and each of the
available alternatives for implementing the split strat-
                                                                     remaining objects is assigned to the closest reference
egy used to manage node overflows. Section 5 presents
                                                                     object. So obtained subsets can recursively be split
experimental results. Section 6 concludes and suggests
                                                                     again, when necessary.
topics for future research activity.
                                                                         Since all of the above organizations build trees by
                                                                     means of a top-down recursive process, these trees are
2    Preliminaries
                                                                     not guaranteed to remain balanced in case of inser-
                                                                     tions and deletions, and require costly reorganizations
Indexing a metric space means to provide an efficient
support for answering similarity   queries, i.e. queries             to prevent performance degradation.
whose purpose is to retrieve DB objects which are
                                                                          ‘In order to simplify the presentation, we sometimes refer to
“similar” to a reference (query) object, and where the               0, as an object, rather than as the feature value of the object
(dis)similarity between objects is measured by a spe-                itself.




                                                               427
3     The M-tree                                                       file.3 In summary, entries in leaf nodes are structured
                                                                       as follows.
The research challenge which has led to the design of
                                                                                                 (feature value of the) DB object
M-tree was to combine advantages of balanced and                             Oj
                                                                                                 object identifier
                                                                             oid(Oj)
dynamic SAMs, with the capabilities of static metric
                                                                                                 distance of Oj from its parent
                                                                             d(Oj, P(Oj))
trees to index objects using features and distance func-
tions which do not fit into a vector space and are only
                                                                       3.2        Processing      Similarity   Queries
constrained by the metric postulates.
    The M-tree partitions objects on the basis of their                Before presenting specific algorithms for building the
relative distances, as measured by a specific distance                 M-tree, we show how the information stored in nodes
function d, and stores these objects into fixed-size                   is used for processing similarity queries. Although per-
nodes,2 which correspond to constrained regions of the                 formance of search algorithms is largely influenced by
metric space. The M-tree is fully parametric on the                    the actual construction of the M-tree , the correctness
distance function d, so the function implementation is                 and the logic of search are independent of such aspects.
a black-box for the M-tree. The theoretical and appli-                     In both the algorithms we present, the objective is
cation background of M-tree is thoroughly described                    to reduce, besides the number of accessed nodes, also
in [ZCRSG]. In this article, we mainly concentrate on                  the number of distance computations needed to exe-
implementation issues in order to establish some basis                 cute queries. This is particularly relevant when the
for performance evaluation and comparison,                             search turns out to be CPU- rather than I/O-bound,
                                                                       which might be the case for computationally intensive
                                                                       distance functions. For this purpose, all the informa-
3.1    The   Structure     of M-tree    Nodes
                                                                       tion concerning (pre-computed) distances stored in the
Leaf nodes of any M-tree store all indexed (database)                  M-tree nodes, i.e. d(O;, P(Oi)) and r(Oi), is used to
objects, represented by their keys or features, whereas                effectively apply the triangle inequality.
internal nodes store the so-called routing objects. A
routing object is a database object to which a routing                 3.2.1        Range      Queries
role is assigned by a specific promotion algorithm (see
                                                                       The query range(&, r(Q)) selects all the DB objects
Section 4).
                                                                       such that d(Oj,Q)   < r(Q). Algorithm RS starts from
    For each routing object 0, there is an associated                  the root node and recursively traverses all the paths
pointer, denoted ptr(T(O,)),    which references the root              which cannot be excluded from leading to objects sat-
of a sub-tree, T(O,), called the cowering tree of 0,.                  isfying the above inequality.
All objects in the covering tree of 0, are within the
                                                                                   Q:query-object, r(Q):search~adius)
                                                                       RS(N:node,
distance r(0,) from O,., r(O,.) > 0, which is called the
                                                                       { let 0, be the parent object of node N;
covering radius of 0, and forms a part of the 0, entry
                                                                         if N is not a leaf
in a particular M-tree node. Finally, a routing object                   then { V 0, in N do:
0, is associated with a distance to P(O,), its parent                                if Id(O,, Q) - d(O,, 0,) ( 5 r(Q) + r(0,)
object, that is the routing object which references the                              then { Compute d(O,, Q);
node where the 0, entry is stored. Obviously, this                                            if 4% 9) I r(Q) + r(Or)
distance is not defined for entries in the root of the                                       then RS(*ptr(T(O,)),Q,r(Q));      }}
                                                                                   { V 0, in N do:
                                                                         else
M-tree. The general information for a routing object
                                                                                     if Id(O,,Q)-d(O,,Or)l<r(Q)
entry is summarized in the following table.
                                                                                     then { Compute d(0, , Q) ;
                                                                                              if d(O,, Q) I r(Q)
                                                                                             then add oid(Oj) to the result ; }}}
                    (feature value of the) routing object
                                                                           Since, when accessing node N, the distance between
~~                                                                     Q and O,, the parent object of N, has already been
                                                                       computed, it is possible to prune a sub-tree without
                                                                                    any new distance at all. The condition ap-
                                                                       computing
An entry for a database object Oj in a leaf is quite sim-
                                                                       plied for pruning is as follows.
ilar to that of a routing object, but no covering radius
is needed, and the pointer field stores the actual ob-                           3.1 If d(Op,Q)
                                                                       Lemma                        > r(Q) + r(O,.), then, for
ject identifier (oid), which is used to provide access to              each object Oj in T(O,),   it is d(Oj,Q) > r(Q). Thus,
the whole object possibly resident on a separate data                         can be safely pruned from the search.
                                                                       T(0,)
                                                                          30f course, the M-tree can also be used as a primary data
   2Nothing would prevent using variable-size nodes, as it is
                                                                       organization, where the whole objects are stored in the leaves of
done in the X-tree [BKK96]. For simplicity, however, we do not
                                                                       the tree.
consider this possibility here.



                                                                 428
and its current k-th nearest neighbor - the order in
                                                                    which nodes are visited can affect performance. The
                                                                    heuristic criterion implemented by the ChooseNode
                                                                    function is to select the node for which the d,,,i,, lower
                                                                    bound is minimum.       According to our experimental
                                                                    observations, other criteria do not lead to better per-
                                                                    formance.
                                       b)
              a)
                                                                    ChooseNode(PR:priority-queue) : node
                                                                    { let Lin(T(O~))      = min{Lin(T(Or))}.
Figure 1: Lemma 3.2 applied to avoid distance com-
                                                                           considering all the entries       in PR;
putations
                                                                      Remove entry Cptr(T(0:)) ,d,i,(T(Oz))l      from PR;
In fact, since d(Oj, Q) 2 d(O,, Q)-d(Oj, 0,) (triangle                return *ptr(T(O:));      }
inequality) and d(Oj,Or)     < r(0,) (def. of covering
                                                                       At the end of execution, the i-th entry of the NN
radius), it is d(Oj,Q) 2 d(O,,&) - r(0,).    Since, by
                                                                    array will have value NNCil = Loid           ,d(Oj, Q)l,
hypothesis, it is d(Or,Q) - r(0,) > r(Q), the result
                                                                    with Oj being the i-th nearest neighbor of Q. The
follows.
                                                                    distance value in the i-th entry is denoted as d;, so
    In order to apply Lemma 3.1, the d(O,, Q) distance
                                                                    that dk is the largest distance value in NN. Clearly, dk
has to be computed. This can be avoided by taking
                                                                    plays the role of a dynamic search radius, since any
advantage of the following result.
                                                                    sub-tree for which dmin(T(Or))      > dk can be safely
         3.2 IfId(O,,Q)-d(Or,O,)l>r(Q)+r(O~),
Lemma                                                               pruned.
then d(O,, Q) > r(Q) + r(O,).                                          Entries of the NR array are initially           set to
                                                                    NNCil = C-,001 (i= l,..., k), i.e. oid’s are undefined
This is a direct consequence of the triangle inequality,
                                                                    and di = 00. As the search starts and (internal) nodes
which guarantees that both d(Or, Q) 2 d(O,, Q) -
                                                                    are accessed, the idea is to compute, for each sub-tree
            (Figure 1 a) and d(O,, Q) 2 d(Or,O,) -
d(O,,O,)
                                                                    T(O,), an upper bound, d,,,(T(O,)),      on the distance
d(O,, Q) (Figure 1 b) hold. The same optimization
                                                                    of any object in T(0,) from Q. The upper bound is
principle is applied to leaf nodes as well. Experimen-
                                                                    set to
tal results (see Section 5) show that this technique
                                                                               dmoz(T(Or)) = d(O,, Q) + r(O,)
can save up to 40% distance computations. The only
case where distances are necessary to compute is when               Consider the simplest case k = 1, two sub-trees,
dealing with the root node, for which 0, is undefined.              f(jA$Land     T(O,,), and amme that d,,,(T(O,,))       =
                                                                             min(T(Ora)) = 7. Since d,,,(T(Op,))     guaran-
                                                                    tees that an object whose distance from Q is at most
                                                                    5 exists in T(O,,), T(O,,) can be pruned from the
3.2.2     Nearest     Neighbor   Queries
                                                                    search. The d,,, bounds are inserted in appropriate
The k-NNSearch algorithm retrieves the k nearest                    positions in the NN array, just leaving the oid field un-
neighbors of a query object Q - it is assumed that                  defined. The k-NNSearch       algorithm is given below.
at least k objects are indexed by the M-tree. We
                                                                    k-NNSearch(T:rootnode,Q:query-object,k:integer)
use a branch-and-bound technique, quite similar to
                                                                    { PR = CT,-];
the one designed for R-trees [RKV95], which utilizes
                                                                       for i = 1 to k do: NN[i] = [-,001;
two global structures: a priority queue, PR, and a k-
                                                                       while PR #a do:
elements array, NN, which, at the end of execution,
                                                                       { Nextlode    = ChooseNode(
contains the result.                                                      k-NNHodeSearch(NextHode,Q,          k); }}
    PR is a queue of pointers to active sub-trees, i.e. sub
trees where qualifying objects can be found. With the                   The $-NNNodeSearch method implements most of
pointer to (the root of) sub-tree T(O,), a lower bound,             the search logic. On an internal node, it first deter-
                on the distance of any object in T(0,)              mines active sub-trees and inserts them into the PR
d,i”(T(O,)),
                                                                    queue. Then, if needed, it calls the RN-Update function
from Q is also kept. The lower bound is
                                                                    (not specified here) t o p er f orm an ordered insertion in
                                     Q) - r(G),
        dmin(T(Or))    = max{d(O,,                0)                the RN array and receives back a (possibly new) value
                                                                    of dk. This is then used to remove from PR all sub-trees
since no object in T(0,) can have a distance from Q
                                                                    for which the dmin lower bound exceeds dk. Similar ac-
less than d(O,, Q) - r(0,). These bounds are used by
                                                                    tions are performed in leaf nodes. In both cases, the
the ChooseNode function to extract from PR the next
                                                                    optimization to reduce the number of distance compu-
node to be examined.
                                                                    tations by means of the pre-computed distances from
   Since the pruning criterion of k-NNSearch   is dy-
                                                                    the parent object, is also applied.
namic - the search radius is the distance between Q


                                                              429
The determination of the set Ni,, - routing objects
k-NNHodeSearch(N:node,Q:query-object    ,k: integer)
{ let 0, be the parent object of node N;                            for which no enlargement of the covering radius is
  if N is not a leaf then                                           needed - can be optimized by saving distance com-
  { V 0, in N do:                                                   putations. From Lemma 3.2, by substituting 0, for &
     if Id(O,,Q) -d(O,,O,)) 5 dk +r(O,)   then                      and setting r(0,) z r(Q) = 0, we derive that:
     { Compute d(O,,Q);
         if        dmi,(T(Or))   5 dk then                          If Id(O,, O,)-d(O,,   Op)(> r(0,)   then d(O,, 0,) > (0,)
         { add Cptr(T(O,)),d,i,(T(O,))l            to PR;
                                                                    from which it follows that 0,. # Ni,,. Note that this
               if dmaz(T(Or)) < dk then
                                                                    optimization cannot be applied in the root node.
               { dk = NN-Update(C-,d,,,(T(O,))l)   ;
                    Remove from PR all entries
                                                                    3.4    Split   Management
                      for which dmin(T(Or)) > dk; }}}}
  else        /*    N is a leaf    */                               As any other dynamic balanced tree, M-tree grows in
  { VO, inNdo:                                                      a bottom-up fashion. The overflow of a node N is
    if (d(O,, Q) - d(O,, 0,) 1< dk then                             managed by allocating a new node, N’, at the same
    {Compute d(O,, Q);                                              level of N, partitioning the entries among these two
      if d(O,, Q) 5 dk then                                         nodes, and posting (promoting) to the parent node,
      { dk = NNJJpdate([oid(O,),d(Oj,Q)l)  ;                        Np, two routing objects to reference the two nodes.
        Remove from PR all entries                                  When the root splits, a new root is created and the
           for which dmin(T(Or)) > dk; }}}}                         M-tree grows by one level up.
                                                                    Split(N:node;      E:H-tree-entry)
3.3      Building          the M-tree
                                                                    { let n/ = entries of node N U {E};
Algorithms for building the M-tree specify how objects                ifN is not the root then
are inserted and deleted, and how node overflows and                      let 0, be the parent of N. stored in Np node;
underflows are managed.. Due to space limitations,                                 a nev node N’;
                                                                      Allocate
                                                                      Promote (JV , O,, , Op2) ;
deletion of objects is not described in this article.
                                                                      Partition(hf,O,,     ,Opa ,n/l ,&I ;
    The Insert algorithm recursively descends the M-
                                                                      Store Nl’s entries in N and Nz’s entries in N’;
tree to locate the most suitable leaf node for accommo-
                                                                      if N is the current root
dating a new object, 0,, possibly triggering a split if
                                                                      then { Allocate      a new root node, Np;
the leaf is full. The basic rationale used to determine
                                                                                Store entry(O,,)     and entryCOp,) in Np; }
the “most suitable” leaf node is to descend, at each
                                                                      else { Replace entry             with entry(O,,) in N,,;
level of the tree, along a sub-tree, T(O,), for which                           if node Np is full
no enlargement of the covering radius is needed, i.e.                           then Split (N,, entryto,,))
d(O,,O,)     5 r(0,).     If multiple sub-trees with this                                                  in N,,; }}
                                                                                else store entry(Op,)
property exist, the one for which object 0, is clos-
                                                                        The Promote method chooses, according to some
est to 0, is chosen. This heuristics tries to obtain
                                                                    specific criterion, two routing objects, O,, and O,, , to
well-clustered sub-trees, which haa a beneficial effect
                                                                    be inserted into the parent node, Np. The Partition
on performance.
                                                                    method divides entries of the overflown node (the N
    If no routing object for which d(O,,O,)        5 ~(0,)
                                                                    set) into two disjoint subsets, Nl and Nz, which are
exists, the choice is to minimize the increase of the
                                                                    then stored in nodes N and N’, respectively. A specific
covering radius, d(O,, 0,) - r(0,). This is tightly re-
                                                                    implementation of the Promote and Partition         meth-
lated to the heuristic criterion that suggests to mini-
mize the overall “volume” covered by routing objects                ods defines what we call a split policy . Unlike other
in the current node.                                                (static) metric tree designs, each relying on a specific
Insert (N :node, entry (0,) :H-tree-entry)                          criterion to organize objects, M-tree offers a possibil-
{ let, N’ be the set of entries in node N;                          ity of implementing alternative split policies, in order
   if N is not a leaf then                                          to tune performance depending on specific application
   { let AC;, = entries such that d(O,,O,) 5 r(0,);
                                                                    needs (see Section 4).
      if AL # 0
                                                                        Regardless of the specific split policy, the semantics
      then let entry          E n/,, :d(O:, 0,) is minimum;
                                                                    of covering radii has to be preserved. If the split node
      else { let entry(O,‘) E N:
                                                                    is a leaf, then the covering radius of a promoted object,
                         d(Or , 0,) - ~(0:) is minimum;
                                                                    say Opl, is set to
              let r(O:) = d(O:,O,);        }
      Insert (*ptr (T(O:)) , entry (0,) ) ; }
                                                                              r(Opl) = ma4d(Oj,     O,,)lO~ E A41
   else /* N is a leaf */
   { if N is not full                                               whereas if overflow occurs in an internal node
      then store entry           in N
      else Split(N,entry(O,))        ; }}                                 r(Opl) = max{d(O,,    O,,) + r(Or)IOr   E NI)


                                                              430
which guarantees that d(Oj,O,,)            5 ~(0~~) holds for                    pair of objects for which the sum of covering radii,
                                                                                 ~(0~~) + r(Opa), is minimum.
any object in T(O,,).

                                                                           mHRAD This is similar to mRAD, but it minimizes the
      Split    Policies
4
                                                                              maximum of the two radii.
The “ideal” split policy should promote O,, and Or,
                                                                           MLBDIST The acronym stands for “Maximum Lower
objects, and partition other objects so that the two so-
                                                                              Bound on DISTance” . This policy differs from
obtained regions would have minimum “volume” and
                                                                              previous ones in that it only uses the pre-
minimum “overlap”. Both criteria aim to improve the
                                                                              computed stored distances. In the confirmed ver-
effectiveness of search algorithms, since having small
                                                                              sion, where O,, 5 O,, the algorithm determines
(low volume) regions leads to well-clustered trees and
                                                                              O,, as the farthest object from O,, that is
reduces the amount of indexed dead space - space
where no object is present - and having small (possi-
bly null) overlap between regions reduces the number
of paths to be traversed for answering a query.
    The minimum-volume criterion leads to devise split                     RANDOMThis variant selects in a random way the ref-
policies which try to minimize the values of the cover-                       erence object(s). Although it is not a “smart”
ing radii, whereas the minimum-overlap requirement                            strategy, it is fast and its performance can be used
suggests that, for fixed values of covering radii, the                        as a reference for other policies.
distance between chosen reference objects should be
                                                                           SAMPLING This is the RAHDOH   policy, but iterated over
maximized.4
                                                                              a sample of objects of size s > 1. For each of the
    Besides above requirements, which are quite “stan-
                                                                              s(s - 1)/2 pairs of objects in the sample, entries
dard” also for SAMs [BKSSSO], the possible high CPU
                                                                              are distributed and potential covering radii estab-
cost of computing distances should also be taken into
                                                                              lished. The pair for which the resulting maximum
account. This suggests that even naive policies (e.g.
                                                                              of the two covering radii is minimum is then se-
a random choice of routing objects), which however
                                                                              lected. In case of confirmed promotion, only s
execute few distance computations, are worth consid-
                                                                              different distributions are tried. The sample size
ering.
                                                                              in our experiments was set to l/10-th of node ca-
                                                                              pacity.
4.1    Choosing       the   Routing     Objects

The Promote method determines, given a set of en-
                                                                           4.2    Distribution    of the   Entries
tries, N, two objects to be promoted and stored into
the parent node. The specific algorithms we consider                       Given a set of entries N and two routing objects O,,
                                                                           and O,,, the problem is how to efficiently partition
can first be classified according to whether or not they
                                                                           N into two subsets, Nl and Nz. For this purpose we
“confirm” the original parent object in its role.
                                                                           consider two basic strategies. The first one is based on
              4.1 A confirmed split policy chooses as
Definition                                                                 the idea of the generalized hyperplane decomposition
one of the promoted objects, say O,, , the object O,,                      [Uhlgl] and leads to unbalanced splits, whereas the
i.e. the parent object of the split node.                                  second obtains a balanced distribution.    They can be
                                                                           shortly described as follows.
In other terms, a confirmed split policy just “extracts”
a region, centered on the second routing object, O,,,                      Generalized   Hyperplane: Assign each object Oj E
from the region which will still remain centered on 0,.                        N to the nearest routing object: if d(Oj,O,,)    5
In general, this simplifies split execution and reduces                        d(Oj, O,,) then assign Oj to Nl, else assign Oj to
the number of distance computations.
                                                                               nr,.
   The alternatives we describe for implementing
Promote are only a selected subset of the whole set                        Balanced: Compute d(Oj, O,,) and d(Oj, Or,) for all
we have experimentally evaluated.                                               Oj E N. Repeat until N is empty:

                                                                                   a Assign to Nl the nearest neighbor of O,, in
mRAD The “minimum             (sum of) RADii” algorithm is
                                                                                     N and remove it from N;
   the most complex           in terms of distance computa-
   tions. It considers       all possible pairs of objects and,                        Assign to Nz the nearest neighbor of Or, in
                                                                                   l
   after partitioning       the set of entries, promotes the                           N and remove it from N.
    4Note that, without a detailed knowledge of the distance
                                                                           Depending on data distribution and on the way how
function, it is impossible to quantify the exact amount of overlap
                                                                           routing objects are chosen, an unbalanced split policy
of two non-disjoint regions in a metric space.



                                                                     431
policy is identified by the suffix I, whereas 2 designates
can lead to a better objects’ partitioning, due to the
                                                                  a “non-confirmed” policy (see Definition 4.1).
additional degree of freedom it obtains. It has to be
                                                                     The first value in each entry pair refers to distance
remarked that, while obtaining a balanced split with
                                                                  computations (CPU cost) and the second value to page
SAMs forces the enlargement of regions along only the
                                                                  reads (I/O costs). The most important observation is
necessary dimensions, in a metric space the consequent
                                                                  that Balanced leads to a considerable CPU overhead
increase of the covering radius would propagate to all
                                                                  and also increases I/O costs. This depends on the
the “dimensions”.
                                                                  total volume covered by an M-tree - the sum of the
                                                                  volumes covered by all its routing objects - as shown
5    Experimental        Results
                                                                  by the “volume overhead” lines in the table. For in-
In this section we provide experimental results on the            stance, on 2-D data sets, using Balanced rather than
performance of M-tree in processing similarity queries.           Generalized       Hyperplane with the RANDOM-1policy
Our implementation is based on the GiST C++ pack-                 leads to an M-tree for which the covered volume is
age [HNP95], and uses a constant node size of 4                   4.60 times larger. Because of these results, in the fol-
KBytes. Although this can influence results, in that              lowing, all the split policies are based on Generalized
node capacity is inversely related to the dimensional-            Hyperplane.
ity of the data sets, we did not investigate the effect           The Effect of Dimensionality
of changing the node size.                                           We now consider how increasing the dimensionality
    We tested all the split policies described in Section         of the data set influences the performance of M-tree.
4, and evaluated them under a variety of experimen-               The number of indexed objects is IO4 in all the graphs.
tal settings. To gain the flexibility needed for com-                 Figure 3 shows that all the split policies but mRAD2
parative analysis, most experiments were baaed on                 and mMRAD-2 compute almost the same number of dis-
synthetic data sets, and here we only report about                tances for building the tree, and that this number de-
them. Data sets were obtained by using the proce-                 creases when Dim grows. The explanation is that in-
dure described in [JDBB] which generates normally-                creasing Dim reduces the node capacity, which has a
distributed clusters in a Dim-D vector space. In all              beneficial effect on the numbers of distances computed
the experiments the number of clusters is 10, the vari-           by insertion and split algorithms.       The reduction is
ance is u2 = 0.1, and clusters’ centers are uniformly             particularly evident for mRAD2 and mMRADZ, whose
distributed (Figure 2 shows a 2-D sample). Distance               CPU split costs grow as the square of node capacity.
is evaluated using the L, metric, i.e. L,(O,,O,)        =         I/O costs, shown in Figure 4, have an inverse trend
maxyjy { 1O,b] - O,b] I), which leads to hyper-cubic              and grow with space dimensionality.         This can again
search (and covering) regions. Graphs concerning con-             be explained by the reduction of node capacity. The
struction costs are obtained by averaging the costs of            fastest split policy is RANDOM2 and the slowest one is,
building 10 M-trees, and results about performance on             not surprisingly, mRADZ2.
query processing are averaged over 100 queries.




                                                                      Figure 3: Distance camp. for building                  M-tree
Figure 2: A sample data set used in the experiments
Balanced vs. Unbalanced          Split Policies
    We first compare the performance of the Balanced
and Generalized     Hyperplane implementations of the
Partition    method (see Section 4.2). Table 1 shows
the overhead of a balanced policy with respect to the
corresponding unbalanced one to process range queries
with side “Imm       (Dim = 2,10) on lo4 objects. Sim-
                                                                            Y:
ilar results were also obtained for larger dimensionali-                                 10   15   20   &   30   25   4     45     50
                                                                                 0   5

ties and for all the policies not shown here. In the ta-
                                                                          Figure 4: I/O costs for building                M-tree
ble, as well as in all other figures, a “confirmed” split


                                                            432
RANDOH-      SAMPLING-1        H-LB-DIST-1       RANDOM-2     mRAD-2
            Dim = 2                              ovh.
                                        volume                          4.60              4.38           3.90            4.07        1.69
                                               I/O ovh.
                                    did. ovh,                        2.27,2.11         1.97,1.76      1.93,1.57       2.09,1.96   1.40,1.30
                                        volume ovh.
            Dim         = 10                                            1.63              1.31           1.49            2.05        2.40
                                    dist. ovh, I/O ovh.              1.58,1.18         1.38,0.92      1.34,0.91       1.55,1.39   1.69,1.12

                    Table 1: Balanced vs. unbalanced split policies: CPU, I/O, and volume overheads
    Figure 5 shows that the “quality” of tree construc-                                   cause of the reduced page capacity, which leads to
tion, measured by the average covered volume per                                          larger trees. For the considered range of Dim values,
page, depends on split policy complexity, and that the                                    node capacity varies by a factor of 10, which almost
criterion of the cheap MLBDIST-I policy is indeed ef-                                     coincides with the ratio of I/O costs at Dim = 50 to
fective enough.                                                                           the costs at Dim = 5.
                                                                                             As to distance selectivity, differences emerge only
                                                                                          with high values of Dim, and favor mMRAD2            and
                                                                                         mRAD-2, which exhibit only a moderate performance
                                                                                         degradation. Since these two policies have the same
                                                                                         complexity, and because of above results, mRAD2 is
                                                                                         discarded in subsequent analyses.
                                                                                          Scalability
           0’
                                                                                             Another major challenge in the design of M-tree was
                0   5     10   15     20         30    36   40   45     so
                                           4:
                                                                                         to ensure scalability of performance with respect to the
     Figure 5: Average covered volume per page                                           size of the indexed data set. This addresses both as-
   Performance on lo-NN query processing, consider-                                      pects of efficiently building the tree and of performing
ing both I/O’s and distance selectivities, is shown in                                   well on similarity queries.
Figures 6 and 7, respectively - distance selectivity is                                      Table 2 shows the average number of distance com-
the ratio of computed distances to the total number of                                   putations and I/O operations per inserted object, for
objects.                                                                                 2-D data sets whose size varies in the range lo4 t 105.
                ,,,,,,,,                     _--                                         Results refer to the RANDOM2policy, but similar trends
                                                                                         were also observed for the other policies. The moder-
                                                                                         ate increase of the average number of distance compu-
                                                                                         tations depends both on the growing height of the tree
                                                                                         and on the higher density of indexed objects within
                                                                                         clusters. This is because the number of clusters was
                                                                                         kept fixed at 10, regardless of the data set size.
                                                                                             Figures 8 and 9 show that both I/O and CPU (dis-
                                                                                         tance computations) lo-NN search costs grow loga-
    Figure 6: I/O’s for processing lo-NN queries                                         rithmically with the number of objects, which demon-
         0.35 ,           ,    ,       ,    ,    ,     ,    ,    ,      ,                strates that M-tree scales well in the data set size, and
                                                                                         that the dynamic management algorithms do not de-
                                                                                         teriorate the quality of the search. It has to be empha-
                                                                                         sized that such a behavior is peculiar to the M-tree,
                                                                                         since other known metric trees are intrinsically static.


           01 5 10 15 20 a 30 35 40 45 50
                ’ 8’       8     t-1
            0           c
   Figure 7: Distance selectivity                    for lo-NN queries
   Some interesting observations can be done about
these results. First, policies based on “non-confirmed”
                                                                                                     o            2
                                                                                                                  ,,t, dcbi.cl‘(xBIW, * 10
promotion, perform better that “confirmed” policies as
to I/O costs, especially on high dimensions where they
                                                                                               Figure 8: I/O’s for processing IO-NN queries
save up to 25% I/O’s. This can be attributed to the
                                                                                            As to the relative behaviors of split policies, fig-
better object clustering that such policies can obtain.
                                                                                         ures show that “cheap” policies (e.g. RANDOM and
I/O costs increase with the dimensionality mainly be-



                                                                                 433
n. ofobjects (x104)     1      2      3        4         5     6          7           8           9           10
              avg. n. dist. camp.    45.0   49.6   53.6    57.5      61.4   65.0       68.7        72.2        73.6     74.7
              avg. n. I/O’s          8.9    9.3    9.4      9.5       9.6   9.6        9.6         9.6         9.7          9.8

 Table 2: Average number of distance computations         and I/O’s for building the M-tree (RANDOHZ split policy)




                                                                             0     5    10    16     al          30    36         a   45
                                                                                                          L%

 Figure 9: Distance computations    for lo-NN queries             Figure 11: Distance computations                    for building         M-tree
                                                                  and R*-tree
M_LBDIST-1) are penalized by the high node capac-
ity (A4 = 60) which arises when indexing 2-D points.                 Figures 12 and 13 show the search costs for square
Indeed, the higher M is, the more effective “complex”            range queries with side D’mm.        It can be observed
split policies are. This is because the number of alter-         that I/O costs for R*-tree are higher than those of all
natives for objects’ promotion grows as M2, thus for             M-tree variants. In order to present a fair compari-
high values of M the probability that cheap policies             son of CPU costs, Figure 13 also shows, for each M-
perform a good choice considerably decreases.                    tree split policy, a graph (labelled (non opt)) where
                                                                 the optimization for reducing the number of distance
                                                                 computations (see Lemma 3.2) is not applied. Graphs
5.1   Comparing     M-tree   and R’-tree
                                                                 show that this optimization is highly effective, saving
The final set of experiments we present compares M-
                                                                 up to 40% distance computations (similar results were
tree with R*-tree. The R*-tree implementation we use
                                                                 obtained for NN-queries).     Note that, even without
is the one available with the GiST package. We defi-
                                                                 such an optimization,    M-tree is almost always more
nitely do not deeply investigate merits and drawbacks
                                                                 efficient than R*-tree.
of the two structures, rather we provide some refer-
                                                                     From these results, which we remind are far from
ence results obtained from an access method which is
                                                                 providing a detailed comparison of M-trees and R*-
well-known and largely used in database systems. Fur-
                                                                 trees, we can anyway see that M-tree is a competi-
thermore, although M-tree has an intrinsically wider
                                                                 tive access method even for indexing data from vector
applicability range, we consider important to evalu-
                                                                 spaces.
ate its relative performance on “traditional” domains
where other access methods could be used as well.
    Results in Figures 10 and 11 compare I/O and
CPU costs, respectively, to build R*-tree and M-tree,
the latter only for the RANDOHZ, MLBDIST-1, and
mMRAD2 policies. The trend of the graphs for R*-tree
confirms what already observed about the influence of
the node capacity (see Figures 3 and 4). Graphs em-
phasize the different perfomance of M-trees and R*-
trees in terms of CPU costs, whereas both structures
                                                                      Figure 12: I/O’s for processing range queries
have similar I/O building costs.




           0 5 101520c 3-J 404550
                         35
                                                                    Figure 13: Distance selectivity                   for range queries
Figure 10: I/O costs for building   M-tree and W-tree


                                                           434
Conclusions
6                                                                     [B097]      T. Bozkaya and M. Ozsoyoglu.        Distance-
                                                                                  baaed indexing for high-dimensional     metric
The M-tree is an original index/storage structure with                            spaces. ACM SIGMOD, pp. 357-368, Tucson,
the following major innovative properties:                                        AZ, May 1997.
        it is a paged, balanced, and dynamic secondary                [Bri95]     S. Brin. Near neighbor search in large met-
    l
                                                                                  ric spaces. 21st VLDB, pp. 574-584, Zurich,
        memory structure able to index data sets from
                                                                                  Switzerland, Sept. 1995.
        generic metric spaces;
        similarity range and nearest neighbor queries can             [Chi94]     T. Chiueh.  Content-based     image indexing.
    l

        be performed and results ranked with respect to                           20th VLDB, pp. 582-593,       Santiago, Chile,
        a given query object;                                                     Sept. 1994.
        query execution is optimized to reduce both the
    l
                                                                                  C. Faloutsos,      W. Equitz,   M. Flickner,
                                                                      [FEF+ 941
        number of page reads and the number of distance                           W. Niblack, D. Petkovic, and R. Barber. Ef-
        computations;                                                             ficient and effective querying by image con-
        it is also suitable for high-dimensional vector data.
    l                                                                             tent. J. of Intell. In. Sys., 3(3/4):231-262,
                                                                                  July 1994.
Experimental results show that M-tree achieves its pri-
                                                                                  C. Faloutsos and K.-I. Lin.      FastMap:    A
                                                                      [FL951
mary goal, that is, dynamicity and, consequently, scal-
                                                                                  fast algorithm for indexing, data-mining and
ability with the size of data sets from generic met-
                                                                                  visualization  of traditional and multimedia
ric spaces. Analysis of many available split policies
                                                                                  datasets. ACM SIGMOD, pp. 163-174, San
suggests that the proper choice should reflect the rel-
                                                                                  Jose, CA, June 1995.
ative weights that CPU (distance computation) and
                                                                      [FRM94]     C. Faloutsos,       M.    Ranganathan,   and
I/O costs may have. This possibility of “tuning” M-
                                                                                  Y. Manolopoulos.     Fast subsequence match-
tree performance with respect to these two cost factors,
                                                                                  ing in time-series databases. ACM SIGMOD,
which are highly dependent on the specific application                            pp. 419-429, Minneapolis, MN, May 1994.
domain, has never been considered before in the anal-
                                                                      [Gut841     A. Guttman. R-trees: A dynamic index struc-
ysis of other metric trees. Our implementation, based
                                                                                  ture for spatial searching. ACM SIGMOD, pp.
on the GiST package, makes clear that M-tree can ef-                              47-57, Boston, MA, June 1984.
fectively extend the set of access methods of a database
                                                                                  J.M. Hellerstein, J.F. Naughton, and A. Pfef-
                                                                      [HNP95]
system.5                                                                          fer. Generalized search trees for database sys-
    Current and planned research work includes: sup-                              tems. 2fst VLDB, Zurich, Switzerland, Sept.
port for more complex similarity queries, use of vari-                            1995.
able size nodes to perform only “good” splits [BKKSG],                            A.K. Jain and R.C. Dubes. Algorithms         for
                                                                      [JD88]
and parallelization of CPU and I/O loads. Real-life                               Clustering Data. Prentice-Hall, 1988.
applications, such as fingerprint identification and pro-                         N. Roussopoulos, S. Kelley, and F. Vincent.
                                                                      [RKV95]
tein matching, are also considered.                                               Nearest neighbor queries. ACM SIGMOD, pp.
                                                                                  71-79, San Jose, CA, May 1995.
Acknowledgements
                                                                                  T.K. Sellis, N. Roussopoulos, and C. Falout-
                                                                      [SRF87]
This work has been funded by the EC ESPRIT LTR Proj.                              SOS.The Rt-tree: A dynamic index for multi-
9141 HERMES. P. Zezula has also been supported by                                 dimensional objects.   13th VLDB, pp. 507-
Grants GACR 102/96/0986 and KONTAKT       PM96 S028.                              518, Brighton, England, Sept. 1987.
                                                                      [Uhf911     J.K. Uhlmann.       Satisfying general proxim-
References                                                                        ity/similarity queries with metric trees. Inf.
                                                                                  Proc. Lett., 40(4):175-179, Nov. 1991.
               R. Agrawal, C. Faloutsos, and A. Swami. Effi-
[AFS93]
                                                                      [VM95]      M. Vassilakopoulos    and Y. Manolopoulos.
               cient similarity search in sequence databases.
                                                                                  Dynamic inverted quadtree: A structure for
               FODO’93, pp. 69-84, Chicago, IL, Oct. 1993.
                                                                                  pictorial databases. In. Sys., 20(6):483-500,
               Springer LNCS, Vol. 730.
                                                                                  Sept. 1995.
               S. Berchtold, D.A. Keim, and H.-P. Kriegel.
[BKK96]
                                                                      [WBKW96]    E. Wold,      T. Blum,      D. Keislar,      and
                             An index structure for high-
               The X-tree:
                                                                                  J. Wheaton.      Content-based  classification,
               dimensional data. 2&nd VLDB, pp. 28-39,
                                                                                  search, and retrieval of audio. IEEE Multi-
               Mumbai (Bombay), India, Sept. 1996.
                                                                                  media, 3(3):27-36, 1996.
               N. Beckmann, H.-P. Kriegel, R. Schneider,
[BKSSSO]
                                                                                  P. Zezula, P. Ciaccia, and F. Rabitti.      M-
                                                                      [ZCR96]
               and B. Seeger. The R.-tree: An efficient and
                                                                                  tree: A dynamic index for similarity queries in
               robust access method for points and rectan-
                                                                                  multimedia databases. TR 7, HERMES ES-
               gles. ACM SIGMOD, pp. 322-331, Atlantic
                                                                                  PRIT LTR Project, 1996. Available at URL
               City, NJ, May 1990.
                                                                                  http://uwu.ced.tuc.gr/hermes/.
    5We are now implementing M-tree in the PostgreSQL system.



                                                                435

Contenu connexe

Tendances

10.1.1.21.5598
10.1.1.21.559810.1.1.21.5598
10.1.1.21.5598azad
 
Financial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksFinancial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksKimmo Soramaki
 
CCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemCCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemPARC, a Xerox company
 
A Comparison of Computation Techniques for DNA Sequence Comparison
A Comparison of Computation Techniques for DNA Sequence Comparison A Comparison of Computation Techniques for DNA Sequence Comparison
A Comparison of Computation Techniques for DNA Sequence Comparison IJORCS
 
ADAPTING METRICS FOR MUSIC SIMILARITY USING COMPARATIVE RATINGS
ADAPTING METRICS FOR MUSIC SIMILARITY  USING COMPARATIVE RATINGS ADAPTING METRICS FOR MUSIC SIMILARITY  USING COMPARATIVE RATINGS
ADAPTING METRICS FOR MUSIC SIMILARITY USING COMPARATIVE RATINGS Daniel Wolff
 
No sql and data scalability
No sql and data scalabilityNo sql and data scalability
No sql and data scalabilityRoger Xia
 
Pioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block CodingPioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block CodingDR.P.S.JAGADEESH KUMAR
 
Video Denoising using Transform Domain Method
Video Denoising using Transform Domain MethodVideo Denoising using Transform Domain Method
Video Denoising using Transform Domain MethodIRJET Journal
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Near-Far Resistance of MC-DS-CDMA Communication Systems
Near-Far Resistance of MC-DS-CDMA Communication SystemsNear-Far Resistance of MC-DS-CDMA Communication Systems
Near-Far Resistance of MC-DS-CDMA Communication SystemsIDES Editor
 
Experimental Applications of Mapping Services in Wireless Sensor Networks
Experimental Applications of Mapping Services in Wireless Sensor NetworksExperimental Applications of Mapping Services in Wireless Sensor Networks
Experimental Applications of Mapping Services in Wireless Sensor NetworksM H
 

Tendances (16)

10.1.1.21.5598
10.1.1.21.559810.1.1.21.5598
10.1.1.21.5598
 
Financial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksFinancial Networks VI - Correlation Networks
Financial Networks VI - Correlation Networks
 
CCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemCCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing System
 
59
5959
59
 
A Comparison of Computation Techniques for DNA Sequence Comparison
A Comparison of Computation Techniques for DNA Sequence Comparison A Comparison of Computation Techniques for DNA Sequence Comparison
A Comparison of Computation Techniques for DNA Sequence Comparison
 
50 55
50 5550 55
50 55
 
ADAPTING METRICS FOR MUSIC SIMILARITY USING COMPARATIVE RATINGS
ADAPTING METRICS FOR MUSIC SIMILARITY  USING COMPARATIVE RATINGS ADAPTING METRICS FOR MUSIC SIMILARITY  USING COMPARATIVE RATINGS
ADAPTING METRICS FOR MUSIC SIMILARITY USING COMPARATIVE RATINGS
 
Ec36783787
Ec36783787Ec36783787
Ec36783787
 
No sql and data scalability
No sql and data scalabilityNo sql and data scalability
No sql and data scalability
 
Pioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block CodingPioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block Coding
 
Video Denoising using Transform Domain Method
Video Denoising using Transform Domain MethodVideo Denoising using Transform Domain Method
Video Denoising using Transform Domain Method
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Near-Far Resistance of MC-DS-CDMA Communication Systems
Near-Far Resistance of MC-DS-CDMA Communication SystemsNear-Far Resistance of MC-DS-CDMA Communication Systems
Near-Far Resistance of MC-DS-CDMA Communication Systems
 
Resource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by FlorianResource Oriented Architecture for Managing Multimedia Content by Florian
Resource Oriented Architecture for Managing Multimedia Content by Florian
 
Experimental Applications of Mapping Services in Wireless Sensor Networks
Experimental Applications of Mapping Services in Wireless Sensor NetworksExperimental Applications of Mapping Services in Wireless Sensor Networks
Experimental Applications of Mapping Services in Wireless Sensor Networks
 

Similaire à m tree

Massive Data Analytics and the Cloud
Massive Data Analytics and the CloudMassive Data Analytics and the Cloud
Massive Data Analytics and the CloudBooz Allen Hamilton
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banksrenguzi
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation DatabaseRaj Bhat
 
Ontology Mapping for Dynamic Multiagent Environment
Ontology Mapping for Dynamic Multiagent Environment Ontology Mapping for Dynamic Multiagent Environment
Ontology Mapping for Dynamic Multiagent Environment IJORCS
 
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...IDES Editor
 
Cassandra framework a service oriented distributed multimedia
Cassandra framework  a service oriented distributed multimediaCassandra framework  a service oriented distributed multimedia
Cassandra framework a service oriented distributed multimediaJoão Gabriel Lima
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsJoão Gabriel Lima
 
10.1.1.21.5883
10.1.1.21.588310.1.1.21.5883
10.1.1.21.5883paserv
 
A survey of peer-to-peer content distribution technologies
A survey of peer-to-peer content distribution technologiesA survey of peer-to-peer content distribution technologies
A survey of peer-to-peer content distribution technologiessharefish
 
Java Abs Peer To Peer Design & Implementation Of A Tuple Space
Java Abs   Peer To Peer Design & Implementation Of A Tuple SpaceJava Abs   Peer To Peer Design & Implementation Of A Tuple Space
Java Abs Peer To Peer Design & Implementation Of A Tuple Spacencct
 
Java Abs Peer To Peer Design & Implementation Of A Tuple S
Java Abs   Peer To Peer Design & Implementation Of A Tuple SJava Abs   Peer To Peer Design & Implementation Of A Tuple S
Java Abs Peer To Peer Design & Implementation Of A Tuple Sncct
 
Dynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed DatabaseDynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed DatabaseEditor IJCATR
 
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...Sofia Eu
 
A Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network ArchitectureA Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network ArchitectureHiroshi Ono
 

Similaire à m tree (20)

Massive Data Analytics and the Cloud
Massive Data Analytics and the CloudMassive Data Analytics and the Cloud
Massive Data Analytics and the Cloud
 
A Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data BanksA Relational Model of Data for Large Shared Data Banks
A Relational Model of Data for Large Shared Data Banks
 
Birthof Relation Database
Birthof Relation DatabaseBirthof Relation Database
Birthof Relation Database
 
Ontology Mapping for Dynamic Multiagent Environment
Ontology Mapping for Dynamic Multiagent Environment Ontology Mapping for Dynamic Multiagent Environment
Ontology Mapping for Dynamic Multiagent Environment
 
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...
Synthesis of Non-Replicated Dynamic Fragment Allocation Algorithm in Distribu...
 
Cassandra framework a service oriented distributed multimedia
Cassandra framework  a service oriented distributed multimediaCassandra framework  a service oriented distributed multimedia
Cassandra framework a service oriented distributed multimedia
 
Data-Intensive Research
Data-Intensive ResearchData-Intensive Research
Data-Intensive Research
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applications
 
10.1.1.21.5883
10.1.1.21.588310.1.1.21.5883
10.1.1.21.5883
 
Amazon SimpleDB
Amazon SimpleDBAmazon SimpleDB
Amazon SimpleDB
 
Replica
ReplicaReplica
Replica
 
A survey of peer-to-peer content distribution technologies
A survey of peer-to-peer content distribution technologiesA survey of peer-to-peer content distribution technologies
A survey of peer-to-peer content distribution technologies
 
Java Abs Peer To Peer Design & Implementation Of A Tuple Space
Java Abs   Peer To Peer Design & Implementation Of A Tuple SpaceJava Abs   Peer To Peer Design & Implementation Of A Tuple Space
Java Abs Peer To Peer Design & Implementation Of A Tuple Space
 
Java Abs Peer To Peer Design & Implementation Of A Tuple S
Java Abs   Peer To Peer Design & Implementation Of A Tuple SJava Abs   Peer To Peer Design & Implementation Of A Tuple S
Java Abs Peer To Peer Design & Implementation Of A Tuple S
 
Dynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed DatabaseDynamic Resource Provisioning with Authentication in Distributed Database
Dynamic Resource Provisioning with Authentication in Distributed Database
 
Sensor Data Management
Sensor Data ManagementSensor Data Management
Sensor Data Management
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
P57 Novelli
P57 NovelliP57 Novelli
P57 Novelli
 
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...
SOFIA - Experiences in Implementing a Cross-domain Use Case by Combining Sema...
 
A Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network ArchitectureA Scalable, Commodity Data Center Network Architecture
A Scalable, Commodity Data Center Network Architecture
 

Plus de Sriram Raj

Data Structures Aptitude
Data Structures AptitudeData Structures Aptitude
Data Structures AptitudeSriram Raj
 
Trees Information
Trees InformationTrees Information
Trees InformationSriram Raj
 
C Languages FAQ's
C Languages FAQ'sC Languages FAQ's
C Languages FAQ'sSriram Raj
 
Beej Guide Network Programming
Beej Guide Network ProgrammingBeej Guide Network Programming
Beej Guide Network ProgrammingSriram Raj
 
Html For Beginners 2
Html For Beginners 2Html For Beginners 2
Html For Beginners 2Sriram Raj
 
Cracking The Interview
Cracking The InterviewCracking The Interview
Cracking The InterviewSriram Raj
 
Queue Data Structure
Queue Data StructureQueue Data Structure
Queue Data StructureSriram Raj
 
Html for Beginners
Html for BeginnersHtml for Beginners
Html for BeginnersSriram Raj
 
Deletion From A Bst
Deletion From A BstDeletion From A Bst
Deletion From A BstSriram Raj
 
Prims Algorithm
Prims AlgorithmPrims Algorithm
Prims AlgorithmSriram Raj
 
Linked List Problems
Linked List ProblemsLinked List Problems
Linked List ProblemsSriram Raj
 
Assignment Examples Final 07 Oct
Assignment Examples Final 07 OctAssignment Examples Final 07 Oct
Assignment Examples Final 07 OctSriram Raj
 

Plus de Sriram Raj (20)

Interviewtcs
InterviewtcsInterviewtcs
Interviewtcs
 
Pointers In C
Pointers In CPointers In C
Pointers In C
 
M tree
M treeM tree
M tree
 
Data Structures Aptitude
Data Structures AptitudeData Structures Aptitude
Data Structures Aptitude
 
Trees Information
Trees InformationTrees Information
Trees Information
 
C Languages FAQ's
C Languages FAQ'sC Languages FAQ's
C Languages FAQ's
 
Beej Guide Network Programming
Beej Guide Network ProgrammingBeej Guide Network Programming
Beej Guide Network Programming
 
Pointers In C
Pointers In CPointers In C
Pointers In C
 
Rfc768
Rfc768Rfc768
Rfc768
 
Sctp
SctpSctp
Sctp
 
Html For Beginners 2
Html For Beginners 2Html For Beginners 2
Html For Beginners 2
 
Cracking The Interview
Cracking The InterviewCracking The Interview
Cracking The Interview
 
Queue Data Structure
Queue Data StructureQueue Data Structure
Queue Data Structure
 
Html for Beginners
Html for BeginnersHtml for Beginners
Html for Beginners
 
Hash Table
Hash TableHash Table
Hash Table
 
Deletion From A Bst
Deletion From A BstDeletion From A Bst
Deletion From A Bst
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Prims Algorithm
Prims AlgorithmPrims Algorithm
Prims Algorithm
 
Linked List Problems
Linked List ProblemsLinked List Problems
Linked List Problems
 
Assignment Examples Final 07 Oct
Assignment Examples Final 07 OctAssignment Examples Final 07 Oct
Assignment Examples Final 07 Oct
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

m tree

  • 1. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces Marco Patella Pa010 Ciaccia Pave1 Zezula DEIS - CSITE-CNR DEIS - CSITE-CNR CNUCE-CNR Bologna, Italy Piss, Italy Bologna, Italy pciaccia@deis.unibo.it mpatella@deis . unibo . it zezula@iei.pi.cnr.it increased and resulted in the development of multi- media database systems aiming at a uniform manage- Abstract ment of voice, video, image, text, and numerical data. Among the many research challenges which the multi- A new access method, called M-tree, is pro- media technology entails - including data placement, posed to organize and search large data sets presentation, synchronization, etc. - content-based re- trieval plays a dominant role. In order to satisfy the from a generic “metric space”, i.e. where ob- information needs of users, it is of vital importance to ject proximity is only defined by a distance effectively and efficiently support the retrieval process function satisfying the positivity, symmetry, devised to determine which portions of the database and triangle inequality postulates. We detail are relevant to users’ requests. algorithms for insertion of objects and split management, which keep the M-tree always In particular, there is an urgent need of index- balanced - several heuristic split alternatives ing techniques able to support execution of similarity are considered and experimentally evaluated. queries. Since multimedia applications typically re- Algorithms for similarity (range and k-nearest quire complex distance functions to quantify similari- Re- neighbors) queries are also described. ties of multi-dimensional features, such as shape, tex- sults from extensive experimentation with a ture, color [FEF+94], image patterns [VM95], sound prototype system are reported, considering as [WBKW96], text, fuzzy values, set values [HNP95], se- the performance criteria the number of page quence data [AFS93, FRM94], etc., multi-dimensional I/O’s and the number of distance computa- (spatial) access methods (SAMs), such as R-tree tions. The results demonstrate that the M- [Gut841 and its variants [SRF87, BKSSSO], have been tree indeed extends the domain of applica- considered to index such data. However, the applica- bility beyond the traditional vector spaces, bility of SAMs is limited by the following assumptions performs reasonably well in high-dimensional which such structures rely on: data spaces, and scales well in case of growing 1. objects are, for indexing purposes, to be repre- files. sented by means of feature values in a multi- dimensional vector space; 1 Introduction 2. the (dis)similarity of any two objects has to be Recently, the need to manage various types of data based on a distance function which does not in- stored in large computer repositories has drastically troduce any correlation (or “cross-talk”) between feature values [FEF+94]. More precisely, an L, to copy without fee all or part of this material Permission is metric, such as the Euclidean distance, has to be granted provided that the copies are not made or distributed for direct commercial advantage, copyright notice and the VLDB used. the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Furthermore, from a performance point of view, SAMs To copy otherwise, requires a fee Endowment. or to republish, assume that comparison of keys (feature values) is a and/or special permission from the Endowment. trivial operation with respect to the cost of accessing a Proceedings of the 23rd VLDB Conference disk page, which is not always the case in multimedia Athens, Greece, 1997 426
  • 2. applications. Consequently, no attempt in the design cific metric distance function d. of these structures has been done to reduce the number Formally, a metric space is a pair, M = (D,d), of distance computations. where V is a domain of feature values - the indexing keys - and d is a total (distance) function with the A more general approach to the “similarity index- ing” problem has gained some popularity in recent following properties:l years, leading to the development of so-called metric (symmetry) 1. d(O,, 0,) = d(C),, 0,) trees (see [UhlSl]). M et ric trees only consider relative distances of objects (rather than their absolute posi- 2. d(O,, 0,) > 0 (0, # 0,) and d(O,, 0,) =0 tions in a multi-dimensional space) to organize and (non negativity) partition the search space, and just require that the 3. d(O,, 0,) 5 4% 0,) + d(Oz,O,) function used to measure the distance (dissimilarity) (triangle inequality) between objects is a metric (see Section 2), so that the triangle inequality property applies and can be used to In principle, there are two basic types of similarity prune the search space. queries: the range query and the k nearest neighbors Although the effectiveness of metric trees has been query. clearly demonstrated [Chi94, Bri95, B097], current de- Definition 2.1 (Range) signs suffer from being intrinsically static, which limits Given a query object Q E 2) and a maximum search their applicability in dynamic database environments. distance r(Q), the range query range(Q, r(Q)) selects Contrary to SAMs, known metric trees have only tried all indexed objects Oj such that d(Oj, Q) 5 r(Q). q to reduce the number of distance computations re- quired to answer a query, paying no attention to I/O Definition 2.2 (k nearest neighbors (LNN)) costs. Given a query object Q E 2) and an integer k 2 1, In this article, we introduce a paged metric tree, the k-NN query NN(&, k) selects the k indexed objects called M-tree, which has been explicitly designed to 0 which have the shortest distance from Q. be integrated with other access methods in database systems. We demonstrate such possibility by imple- There have already been some attempts to tackle the menting M-tree in the GiST (Generalized Search Tree) difficult metric space indexing problem. The FastMap [HNP95] framework, which allows specific access meth- algorithm [FL951 transforms a matrix of pairwise dis- ods to be added to an extensible database system. tances into a set of low-dimensional points, which can then be indexed by a SAM. However, FastMap as- The M-tree is a balanced tree, able to deal with sumes a static data set and introduces approximation dynamic data files, and as such it does not require errors in the mapping process. The Vantage Point periodical reorganizations. M-tree can index objects (VP) tree [Chi94] partitions a data set according to using features compared by distance functions which distances the objects have with respect to a reference either do not fit into a vector space or do not use an L, (vantage) point. The median value of such distances metric, thus considerably extends the cases for which is used as a separator to partition objects into two efficient query processing is possible. Since the design balanced subsets, to which the same procedure can re- of M-tree is inspired by both principles of metric trees cursively be applied. The MVP-tree [B097] extends and database access methods, performance optimiza- this idea by using multiple vantage points, and exploits tion concerns both CPU (distance computations) and pre-computed distances to reduce the number of dis- I/O costs. tance computations at query time. The GNAT de- After providing some preliminary background in sign [Bri95] applies a different - so-called generalized Section 2, Section 3 introduces the basic M-tree prin- [Uhlgl] - partitioning style. In the basic hyperplane ciples and algorithms. In Section 4, we discuss the case, two reference objects are chosen and each of the available alternatives for implementing the split strat- remaining objects is assigned to the closest reference egy used to manage node overflows. Section 5 presents object. So obtained subsets can recursively be split experimental results. Section 6 concludes and suggests again, when necessary. topics for future research activity. Since all of the above organizations build trees by means of a top-down recursive process, these trees are 2 Preliminaries not guaranteed to remain balanced in case of inser- tions and deletions, and require costly reorganizations Indexing a metric space means to provide an efficient support for answering similarity queries, i.e. queries to prevent performance degradation. whose purpose is to retrieve DB objects which are ‘In order to simplify the presentation, we sometimes refer to “similar” to a reference (query) object, and where the 0, as an object, rather than as the feature value of the object (dis)similarity between objects is measured by a spe- itself. 427
  • 3. 3 The M-tree file.3 In summary, entries in leaf nodes are structured as follows. The research challenge which has led to the design of (feature value of the) DB object M-tree was to combine advantages of balanced and Oj object identifier oid(Oj) dynamic SAMs, with the capabilities of static metric distance of Oj from its parent d(Oj, P(Oj)) trees to index objects using features and distance func- tions which do not fit into a vector space and are only 3.2 Processing Similarity Queries constrained by the metric postulates. The M-tree partitions objects on the basis of their Before presenting specific algorithms for building the relative distances, as measured by a specific distance M-tree, we show how the information stored in nodes function d, and stores these objects into fixed-size is used for processing similarity queries. Although per- nodes,2 which correspond to constrained regions of the formance of search algorithms is largely influenced by metric space. The M-tree is fully parametric on the the actual construction of the M-tree , the correctness distance function d, so the function implementation is and the logic of search are independent of such aspects. a black-box for the M-tree. The theoretical and appli- In both the algorithms we present, the objective is cation background of M-tree is thoroughly described to reduce, besides the number of accessed nodes, also in [ZCRSG]. In this article, we mainly concentrate on the number of distance computations needed to exe- implementation issues in order to establish some basis cute queries. This is particularly relevant when the for performance evaluation and comparison, search turns out to be CPU- rather than I/O-bound, which might be the case for computationally intensive distance functions. For this purpose, all the informa- 3.1 The Structure of M-tree Nodes tion concerning (pre-computed) distances stored in the Leaf nodes of any M-tree store all indexed (database) M-tree nodes, i.e. d(O;, P(Oi)) and r(Oi), is used to objects, represented by their keys or features, whereas effectively apply the triangle inequality. internal nodes store the so-called routing objects. A routing object is a database object to which a routing 3.2.1 Range Queries role is assigned by a specific promotion algorithm (see The query range(&, r(Q)) selects all the DB objects Section 4). such that d(Oj,Q) < r(Q). Algorithm RS starts from For each routing object 0, there is an associated the root node and recursively traverses all the paths pointer, denoted ptr(T(O,)), which references the root which cannot be excluded from leading to objects sat- of a sub-tree, T(O,), called the cowering tree of 0,. isfying the above inequality. All objects in the covering tree of 0, are within the Q:query-object, r(Q):search~adius) RS(N:node, distance r(0,) from O,., r(O,.) > 0, which is called the { let 0, be the parent object of node N; covering radius of 0, and forms a part of the 0, entry if N is not a leaf in a particular M-tree node. Finally, a routing object then { V 0, in N do: 0, is associated with a distance to P(O,), its parent if Id(O,, Q) - d(O,, 0,) ( 5 r(Q) + r(0,) object, that is the routing object which references the then { Compute d(O,, Q); node where the 0, entry is stored. Obviously, this if 4% 9) I r(Q) + r(Or) distance is not defined for entries in the root of the then RS(*ptr(T(O,)),Q,r(Q)); }} { V 0, in N do: else M-tree. The general information for a routing object if Id(O,,Q)-d(O,,Or)l<r(Q) entry is summarized in the following table. then { Compute d(0, , Q) ; if d(O,, Q) I r(Q) then add oid(Oj) to the result ; }}} (feature value of the) routing object Since, when accessing node N, the distance between ~~ Q and O,, the parent object of N, has already been computed, it is possible to prune a sub-tree without any new distance at all. The condition ap- computing An entry for a database object Oj in a leaf is quite sim- plied for pruning is as follows. ilar to that of a routing object, but no covering radius is needed, and the pointer field stores the actual ob- 3.1 If d(Op,Q) Lemma > r(Q) + r(O,.), then, for ject identifier (oid), which is used to provide access to each object Oj in T(O,), it is d(Oj,Q) > r(Q). Thus, the whole object possibly resident on a separate data can be safely pruned from the search. T(0,) 30f course, the M-tree can also be used as a primary data 2Nothing would prevent using variable-size nodes, as it is organization, where the whole objects are stored in the leaves of done in the X-tree [BKK96]. For simplicity, however, we do not the tree. consider this possibility here. 428
  • 4. and its current k-th nearest neighbor - the order in which nodes are visited can affect performance. The heuristic criterion implemented by the ChooseNode function is to select the node for which the d,,,i,, lower bound is minimum. According to our experimental observations, other criteria do not lead to better per- formance. b) a) ChooseNode(PR:priority-queue) : node { let Lin(T(O~)) = min{Lin(T(Or))}. Figure 1: Lemma 3.2 applied to avoid distance com- considering all the entries in PR; putations Remove entry Cptr(T(0:)) ,d,i,(T(Oz))l from PR; In fact, since d(Oj, Q) 2 d(O,, Q)-d(Oj, 0,) (triangle return *ptr(T(O:)); } inequality) and d(Oj,Or) < r(0,) (def. of covering At the end of execution, the i-th entry of the NN radius), it is d(Oj,Q) 2 d(O,,&) - r(0,). Since, by array will have value NNCil = Loid ,d(Oj, Q)l, hypothesis, it is d(Or,Q) - r(0,) > r(Q), the result with Oj being the i-th nearest neighbor of Q. The follows. distance value in the i-th entry is denoted as d;, so In order to apply Lemma 3.1, the d(O,, Q) distance that dk is the largest distance value in NN. Clearly, dk has to be computed. This can be avoided by taking plays the role of a dynamic search radius, since any advantage of the following result. sub-tree for which dmin(T(Or)) > dk can be safely 3.2 IfId(O,,Q)-d(Or,O,)l>r(Q)+r(O~), Lemma pruned. then d(O,, Q) > r(Q) + r(O,). Entries of the NR array are initially set to NNCil = C-,001 (i= l,..., k), i.e. oid’s are undefined This is a direct consequence of the triangle inequality, and di = 00. As the search starts and (internal) nodes which guarantees that both d(Or, Q) 2 d(O,, Q) - are accessed, the idea is to compute, for each sub-tree (Figure 1 a) and d(O,, Q) 2 d(Or,O,) - d(O,,O,) T(O,), an upper bound, d,,,(T(O,)), on the distance d(O,, Q) (Figure 1 b) hold. The same optimization of any object in T(0,) from Q. The upper bound is principle is applied to leaf nodes as well. Experimen- set to tal results (see Section 5) show that this technique dmoz(T(Or)) = d(O,, Q) + r(O,) can save up to 40% distance computations. The only case where distances are necessary to compute is when Consider the simplest case k = 1, two sub-trees, dealing with the root node, for which 0, is undefined. f(jA$Land T(O,,), and amme that d,,,(T(O,,)) = min(T(Ora)) = 7. Since d,,,(T(Op,)) guaran- tees that an object whose distance from Q is at most 5 exists in T(O,,), T(O,,) can be pruned from the 3.2.2 Nearest Neighbor Queries search. The d,,, bounds are inserted in appropriate The k-NNSearch algorithm retrieves the k nearest positions in the NN array, just leaving the oid field un- neighbors of a query object Q - it is assumed that defined. The k-NNSearch algorithm is given below. at least k objects are indexed by the M-tree. We k-NNSearch(T:rootnode,Q:query-object,k:integer) use a branch-and-bound technique, quite similar to { PR = CT,-]; the one designed for R-trees [RKV95], which utilizes for i = 1 to k do: NN[i] = [-,001; two global structures: a priority queue, PR, and a k- while PR #a do: elements array, NN, which, at the end of execution, { Nextlode = ChooseNode( contains the result. k-NNHodeSearch(NextHode,Q, k); }} PR is a queue of pointers to active sub-trees, i.e. sub trees where qualifying objects can be found. With the The $-NNNodeSearch method implements most of pointer to (the root of) sub-tree T(O,), a lower bound, the search logic. On an internal node, it first deter- on the distance of any object in T(0,) mines active sub-trees and inserts them into the PR d,i”(T(O,)), queue. Then, if needed, it calls the RN-Update function from Q is also kept. The lower bound is (not specified here) t o p er f orm an ordered insertion in Q) - r(G), dmin(T(Or)) = max{d(O,, 0) the RN array and receives back a (possibly new) value of dk. This is then used to remove from PR all sub-trees since no object in T(0,) can have a distance from Q for which the dmin lower bound exceeds dk. Similar ac- less than d(O,, Q) - r(0,). These bounds are used by tions are performed in leaf nodes. In both cases, the the ChooseNode function to extract from PR the next optimization to reduce the number of distance compu- node to be examined. tations by means of the pre-computed distances from Since the pruning criterion of k-NNSearch is dy- the parent object, is also applied. namic - the search radius is the distance between Q 429
  • 5. The determination of the set Ni,, - routing objects k-NNHodeSearch(N:node,Q:query-object ,k: integer) { let 0, be the parent object of node N; for which no enlargement of the covering radius is if N is not a leaf then needed - can be optimized by saving distance com- { V 0, in N do: putations. From Lemma 3.2, by substituting 0, for & if Id(O,,Q) -d(O,,O,)) 5 dk +r(O,) then and setting r(0,) z r(Q) = 0, we derive that: { Compute d(O,,Q); if dmi,(T(Or)) 5 dk then If Id(O,, O,)-d(O,, Op)(> r(0,) then d(O,, 0,) > (0,) { add Cptr(T(O,)),d,i,(T(O,))l to PR; from which it follows that 0,. # Ni,,. Note that this if dmaz(T(Or)) < dk then optimization cannot be applied in the root node. { dk = NN-Update(C-,d,,,(T(O,))l) ; Remove from PR all entries 3.4 Split Management for which dmin(T(Or)) > dk; }}}} else /* N is a leaf */ As any other dynamic balanced tree, M-tree grows in { VO, inNdo: a bottom-up fashion. The overflow of a node N is if (d(O,, Q) - d(O,, 0,) 1< dk then managed by allocating a new node, N’, at the same {Compute d(O,, Q); level of N, partitioning the entries among these two if d(O,, Q) 5 dk then nodes, and posting (promoting) to the parent node, { dk = NNJJpdate([oid(O,),d(Oj,Q)l) ; Np, two routing objects to reference the two nodes. Remove from PR all entries When the root splits, a new root is created and the for which dmin(T(Or)) > dk; }}}} M-tree grows by one level up. Split(N:node; E:H-tree-entry) 3.3 Building the M-tree { let n/ = entries of node N U {E}; Algorithms for building the M-tree specify how objects ifN is not the root then are inserted and deleted, and how node overflows and let 0, be the parent of N. stored in Np node; underflows are managed.. Due to space limitations, a nev node N’; Allocate Promote (JV , O,, , Op2) ; deletion of objects is not described in this article. Partition(hf,O,, ,Opa ,n/l ,&I ; The Insert algorithm recursively descends the M- Store Nl’s entries in N and Nz’s entries in N’; tree to locate the most suitable leaf node for accommo- if N is the current root dating a new object, 0,, possibly triggering a split if then { Allocate a new root node, Np; the leaf is full. The basic rationale used to determine Store entry(O,,) and entryCOp,) in Np; } the “most suitable” leaf node is to descend, at each else { Replace entry with entry(O,,) in N,,; level of the tree, along a sub-tree, T(O,), for which if node Np is full no enlargement of the covering radius is needed, i.e. then Split (N,, entryto,,)) d(O,,O,) 5 r(0,). If multiple sub-trees with this in N,,; }} else store entry(Op,) property exist, the one for which object 0, is clos- The Promote method chooses, according to some est to 0, is chosen. This heuristics tries to obtain specific criterion, two routing objects, O,, and O,, , to well-clustered sub-trees, which haa a beneficial effect be inserted into the parent node, Np. The Partition on performance. method divides entries of the overflown node (the N If no routing object for which d(O,,O,) 5 ~(0,) set) into two disjoint subsets, Nl and Nz, which are exists, the choice is to minimize the increase of the then stored in nodes N and N’, respectively. A specific covering radius, d(O,, 0,) - r(0,). This is tightly re- implementation of the Promote and Partition meth- lated to the heuristic criterion that suggests to mini- mize the overall “volume” covered by routing objects ods defines what we call a split policy . Unlike other in the current node. (static) metric tree designs, each relying on a specific Insert (N :node, entry (0,) :H-tree-entry) criterion to organize objects, M-tree offers a possibil- { let, N’ be the set of entries in node N; ity of implementing alternative split policies, in order if N is not a leaf then to tune performance depending on specific application { let AC;, = entries such that d(O,,O,) 5 r(0,); needs (see Section 4). if AL # 0 Regardless of the specific split policy, the semantics then let entry E n/,, :d(O:, 0,) is minimum; of covering radii has to be preserved. If the split node else { let entry(O,‘) E N: is a leaf, then the covering radius of a promoted object, d(Or , 0,) - ~(0:) is minimum; say Opl, is set to let r(O:) = d(O:,O,); } Insert (*ptr (T(O:)) , entry (0,) ) ; } r(Opl) = ma4d(Oj, O,,)lO~ E A41 else /* N is a leaf */ { if N is not full whereas if overflow occurs in an internal node then store entry in N else Split(N,entry(O,)) ; }} r(Opl) = max{d(O,, O,,) + r(Or)IOr E NI) 430
  • 6. which guarantees that d(Oj,O,,) 5 ~(0~~) holds for pair of objects for which the sum of covering radii, ~(0~~) + r(Opa), is minimum. any object in T(O,,). mHRAD This is similar to mRAD, but it minimizes the Split Policies 4 maximum of the two radii. The “ideal” split policy should promote O,, and Or, MLBDIST The acronym stands for “Maximum Lower objects, and partition other objects so that the two so- Bound on DISTance” . This policy differs from obtained regions would have minimum “volume” and previous ones in that it only uses the pre- minimum “overlap”. Both criteria aim to improve the computed stored distances. In the confirmed ver- effectiveness of search algorithms, since having small sion, where O,, 5 O,, the algorithm determines (low volume) regions leads to well-clustered trees and O,, as the farthest object from O,, that is reduces the amount of indexed dead space - space where no object is present - and having small (possi- bly null) overlap between regions reduces the number of paths to be traversed for answering a query. The minimum-volume criterion leads to devise split RANDOMThis variant selects in a random way the ref- policies which try to minimize the values of the cover- erence object(s). Although it is not a “smart” ing radii, whereas the minimum-overlap requirement strategy, it is fast and its performance can be used suggests that, for fixed values of covering radii, the as a reference for other policies. distance between chosen reference objects should be SAMPLING This is the RAHDOH policy, but iterated over maximized.4 a sample of objects of size s > 1. For each of the Besides above requirements, which are quite “stan- s(s - 1)/2 pairs of objects in the sample, entries dard” also for SAMs [BKSSSO], the possible high CPU are distributed and potential covering radii estab- cost of computing distances should also be taken into lished. The pair for which the resulting maximum account. This suggests that even naive policies (e.g. of the two covering radii is minimum is then se- a random choice of routing objects), which however lected. In case of confirmed promotion, only s execute few distance computations, are worth consid- different distributions are tried. The sample size ering. in our experiments was set to l/10-th of node ca- pacity. 4.1 Choosing the Routing Objects The Promote method determines, given a set of en- 4.2 Distribution of the Entries tries, N, two objects to be promoted and stored into the parent node. The specific algorithms we consider Given a set of entries N and two routing objects O,, and O,,, the problem is how to efficiently partition can first be classified according to whether or not they N into two subsets, Nl and Nz. For this purpose we “confirm” the original parent object in its role. consider two basic strategies. The first one is based on 4.1 A confirmed split policy chooses as Definition the idea of the generalized hyperplane decomposition one of the promoted objects, say O,, , the object O,, [Uhlgl] and leads to unbalanced splits, whereas the i.e. the parent object of the split node. second obtains a balanced distribution. They can be shortly described as follows. In other terms, a confirmed split policy just “extracts” a region, centered on the second routing object, O,,, Generalized Hyperplane: Assign each object Oj E from the region which will still remain centered on 0,. N to the nearest routing object: if d(Oj,O,,) 5 In general, this simplifies split execution and reduces d(Oj, O,,) then assign Oj to Nl, else assign Oj to the number of distance computations. nr,. The alternatives we describe for implementing Promote are only a selected subset of the whole set Balanced: Compute d(Oj, O,,) and d(Oj, Or,) for all we have experimentally evaluated. Oj E N. Repeat until N is empty: a Assign to Nl the nearest neighbor of O,, in mRAD The “minimum (sum of) RADii” algorithm is N and remove it from N; the most complex in terms of distance computa- tions. It considers all possible pairs of objects and, Assign to Nz the nearest neighbor of Or, in l after partitioning the set of entries, promotes the N and remove it from N. 4Note that, without a detailed knowledge of the distance Depending on data distribution and on the way how function, it is impossible to quantify the exact amount of overlap routing objects are chosen, an unbalanced split policy of two non-disjoint regions in a metric space. 431
  • 7. policy is identified by the suffix I, whereas 2 designates can lead to a better objects’ partitioning, due to the a “non-confirmed” policy (see Definition 4.1). additional degree of freedom it obtains. It has to be The first value in each entry pair refers to distance remarked that, while obtaining a balanced split with computations (CPU cost) and the second value to page SAMs forces the enlargement of regions along only the reads (I/O costs). The most important observation is necessary dimensions, in a metric space the consequent that Balanced leads to a considerable CPU overhead increase of the covering radius would propagate to all and also increases I/O costs. This depends on the the “dimensions”. total volume covered by an M-tree - the sum of the volumes covered by all its routing objects - as shown 5 Experimental Results by the “volume overhead” lines in the table. For in- In this section we provide experimental results on the stance, on 2-D data sets, using Balanced rather than performance of M-tree in processing similarity queries. Generalized Hyperplane with the RANDOM-1policy Our implementation is based on the GiST C++ pack- leads to an M-tree for which the covered volume is age [HNP95], and uses a constant node size of 4 4.60 times larger. Because of these results, in the fol- KBytes. Although this can influence results, in that lowing, all the split policies are based on Generalized node capacity is inversely related to the dimensional- Hyperplane. ity of the data sets, we did not investigate the effect The Effect of Dimensionality of changing the node size. We now consider how increasing the dimensionality We tested all the split policies described in Section of the data set influences the performance of M-tree. 4, and evaluated them under a variety of experimen- The number of indexed objects is IO4 in all the graphs. tal settings. To gain the flexibility needed for com- Figure 3 shows that all the split policies but mRAD2 parative analysis, most experiments were baaed on and mMRAD-2 compute almost the same number of dis- synthetic data sets, and here we only report about tances for building the tree, and that this number de- them. Data sets were obtained by using the proce- creases when Dim grows. The explanation is that in- dure described in [JDBB] which generates normally- creasing Dim reduces the node capacity, which has a distributed clusters in a Dim-D vector space. In all beneficial effect on the numbers of distances computed the experiments the number of clusters is 10, the vari- by insertion and split algorithms. The reduction is ance is u2 = 0.1, and clusters’ centers are uniformly particularly evident for mRAD2 and mMRADZ, whose distributed (Figure 2 shows a 2-D sample). Distance CPU split costs grow as the square of node capacity. is evaluated using the L, metric, i.e. L,(O,,O,) = I/O costs, shown in Figure 4, have an inverse trend maxyjy { 1O,b] - O,b] I), which leads to hyper-cubic and grow with space dimensionality. This can again search (and covering) regions. Graphs concerning con- be explained by the reduction of node capacity. The struction costs are obtained by averaging the costs of fastest split policy is RANDOM2 and the slowest one is, building 10 M-trees, and results about performance on not surprisingly, mRADZ2. query processing are averaged over 100 queries. Figure 3: Distance camp. for building M-tree Figure 2: A sample data set used in the experiments Balanced vs. Unbalanced Split Policies We first compare the performance of the Balanced and Generalized Hyperplane implementations of the Partition method (see Section 4.2). Table 1 shows the overhead of a balanced policy with respect to the corresponding unbalanced one to process range queries with side “Imm (Dim = 2,10) on lo4 objects. Sim- Y: ilar results were also obtained for larger dimensionali- 10 15 20 & 30 25 4 45 50 0 5 ties and for all the policies not shown here. In the ta- Figure 4: I/O costs for building M-tree ble, as well as in all other figures, a “confirmed” split 432
  • 8. RANDOH- SAMPLING-1 H-LB-DIST-1 RANDOM-2 mRAD-2 Dim = 2 ovh. volume 4.60 4.38 3.90 4.07 1.69 I/O ovh. did. ovh, 2.27,2.11 1.97,1.76 1.93,1.57 2.09,1.96 1.40,1.30 volume ovh. Dim = 10 1.63 1.31 1.49 2.05 2.40 dist. ovh, I/O ovh. 1.58,1.18 1.38,0.92 1.34,0.91 1.55,1.39 1.69,1.12 Table 1: Balanced vs. unbalanced split policies: CPU, I/O, and volume overheads Figure 5 shows that the “quality” of tree construc- cause of the reduced page capacity, which leads to tion, measured by the average covered volume per larger trees. For the considered range of Dim values, page, depends on split policy complexity, and that the node capacity varies by a factor of 10, which almost criterion of the cheap MLBDIST-I policy is indeed ef- coincides with the ratio of I/O costs at Dim = 50 to fective enough. the costs at Dim = 5. As to distance selectivity, differences emerge only with high values of Dim, and favor mMRAD2 and mRAD-2, which exhibit only a moderate performance degradation. Since these two policies have the same complexity, and because of above results, mRAD2 is discarded in subsequent analyses. Scalability 0’ Another major challenge in the design of M-tree was 0 5 10 15 20 30 36 40 45 so 4: to ensure scalability of performance with respect to the Figure 5: Average covered volume per page size of the indexed data set. This addresses both as- Performance on lo-NN query processing, consider- pects of efficiently building the tree and of performing ing both I/O’s and distance selectivities, is shown in well on similarity queries. Figures 6 and 7, respectively - distance selectivity is Table 2 shows the average number of distance com- the ratio of computed distances to the total number of putations and I/O operations per inserted object, for objects. 2-D data sets whose size varies in the range lo4 t 105. ,,,,,,,, _-- Results refer to the RANDOM2policy, but similar trends were also observed for the other policies. The moder- ate increase of the average number of distance compu- tations depends both on the growing height of the tree and on the higher density of indexed objects within clusters. This is because the number of clusters was kept fixed at 10, regardless of the data set size. Figures 8 and 9 show that both I/O and CPU (dis- tance computations) lo-NN search costs grow loga- Figure 6: I/O’s for processing lo-NN queries rithmically with the number of objects, which demon- 0.35 , , , , , , , , , , strates that M-tree scales well in the data set size, and that the dynamic management algorithms do not de- teriorate the quality of the search. It has to be empha- sized that such a behavior is peculiar to the M-tree, since other known metric trees are intrinsically static. 01 5 10 15 20 a 30 35 40 45 50 ’ 8’ 8 t-1 0 c Figure 7: Distance selectivity for lo-NN queries Some interesting observations can be done about these results. First, policies based on “non-confirmed” o 2 ,,t, dcbi.cl‘(xBIW, * 10 promotion, perform better that “confirmed” policies as to I/O costs, especially on high dimensions where they Figure 8: I/O’s for processing IO-NN queries save up to 25% I/O’s. This can be attributed to the As to the relative behaviors of split policies, fig- better object clustering that such policies can obtain. ures show that “cheap” policies (e.g. RANDOM and I/O costs increase with the dimensionality mainly be- 433
  • 9. n. ofobjects (x104) 1 2 3 4 5 6 7 8 9 10 avg. n. dist. camp. 45.0 49.6 53.6 57.5 61.4 65.0 68.7 72.2 73.6 74.7 avg. n. I/O’s 8.9 9.3 9.4 9.5 9.6 9.6 9.6 9.6 9.7 9.8 Table 2: Average number of distance computations and I/O’s for building the M-tree (RANDOHZ split policy) 0 5 10 16 al 30 36 a 45 L% Figure 9: Distance computations for lo-NN queries Figure 11: Distance computations for building M-tree and R*-tree M_LBDIST-1) are penalized by the high node capac- ity (A4 = 60) which arises when indexing 2-D points. Figures 12 and 13 show the search costs for square Indeed, the higher M is, the more effective “complex” range queries with side D’mm. It can be observed split policies are. This is because the number of alter- that I/O costs for R*-tree are higher than those of all natives for objects’ promotion grows as M2, thus for M-tree variants. In order to present a fair compari- high values of M the probability that cheap policies son of CPU costs, Figure 13 also shows, for each M- perform a good choice considerably decreases. tree split policy, a graph (labelled (non opt)) where the optimization for reducing the number of distance computations (see Lemma 3.2) is not applied. Graphs 5.1 Comparing M-tree and R’-tree show that this optimization is highly effective, saving The final set of experiments we present compares M- up to 40% distance computations (similar results were tree with R*-tree. The R*-tree implementation we use obtained for NN-queries). Note that, even without is the one available with the GiST package. We defi- such an optimization, M-tree is almost always more nitely do not deeply investigate merits and drawbacks efficient than R*-tree. of the two structures, rather we provide some refer- From these results, which we remind are far from ence results obtained from an access method which is providing a detailed comparison of M-trees and R*- well-known and largely used in database systems. Fur- trees, we can anyway see that M-tree is a competi- thermore, although M-tree has an intrinsically wider tive access method even for indexing data from vector applicability range, we consider important to evalu- spaces. ate its relative performance on “traditional” domains where other access methods could be used as well. Results in Figures 10 and 11 compare I/O and CPU costs, respectively, to build R*-tree and M-tree, the latter only for the RANDOHZ, MLBDIST-1, and mMRAD2 policies. The trend of the graphs for R*-tree confirms what already observed about the influence of the node capacity (see Figures 3 and 4). Graphs em- phasize the different perfomance of M-trees and R*- trees in terms of CPU costs, whereas both structures Figure 12: I/O’s for processing range queries have similar I/O building costs. 0 5 101520c 3-J 404550 35 Figure 13: Distance selectivity for range queries Figure 10: I/O costs for building M-tree and W-tree 434
  • 10. Conclusions 6 [B097] T. Bozkaya and M. Ozsoyoglu. Distance- baaed indexing for high-dimensional metric The M-tree is an original index/storage structure with spaces. ACM SIGMOD, pp. 357-368, Tucson, the following major innovative properties: AZ, May 1997. it is a paged, balanced, and dynamic secondary [Bri95] S. Brin. Near neighbor search in large met- l ric spaces. 21st VLDB, pp. 574-584, Zurich, memory structure able to index data sets from Switzerland, Sept. 1995. generic metric spaces; similarity range and nearest neighbor queries can [Chi94] T. Chiueh. Content-based image indexing. l be performed and results ranked with respect to 20th VLDB, pp. 582-593, Santiago, Chile, a given query object; Sept. 1994. query execution is optimized to reduce both the l C. Faloutsos, W. Equitz, M. Flickner, [FEF+ 941 number of page reads and the number of distance W. Niblack, D. Petkovic, and R. Barber. Ef- computations; ficient and effective querying by image con- it is also suitable for high-dimensional vector data. l tent. J. of Intell. In. Sys., 3(3/4):231-262, July 1994. Experimental results show that M-tree achieves its pri- C. Faloutsos and K.-I. Lin. FastMap: A [FL951 mary goal, that is, dynamicity and, consequently, scal- fast algorithm for indexing, data-mining and ability with the size of data sets from generic met- visualization of traditional and multimedia ric spaces. Analysis of many available split policies datasets. ACM SIGMOD, pp. 163-174, San suggests that the proper choice should reflect the rel- Jose, CA, June 1995. ative weights that CPU (distance computation) and [FRM94] C. Faloutsos, M. Ranganathan, and I/O costs may have. This possibility of “tuning” M- Y. Manolopoulos. Fast subsequence match- tree performance with respect to these two cost factors, ing in time-series databases. ACM SIGMOD, which are highly dependent on the specific application pp. 419-429, Minneapolis, MN, May 1994. domain, has never been considered before in the anal- [Gut841 A. Guttman. R-trees: A dynamic index struc- ysis of other metric trees. Our implementation, based ture for spatial searching. ACM SIGMOD, pp. on the GiST package, makes clear that M-tree can ef- 47-57, Boston, MA, June 1984. fectively extend the set of access methods of a database J.M. Hellerstein, J.F. Naughton, and A. Pfef- [HNP95] system.5 fer. Generalized search trees for database sys- Current and planned research work includes: sup- tems. 2fst VLDB, Zurich, Switzerland, Sept. port for more complex similarity queries, use of vari- 1995. able size nodes to perform only “good” splits [BKKSG], A.K. Jain and R.C. Dubes. Algorithms for [JD88] and parallelization of CPU and I/O loads. Real-life Clustering Data. Prentice-Hall, 1988. applications, such as fingerprint identification and pro- N. Roussopoulos, S. Kelley, and F. Vincent. [RKV95] tein matching, are also considered. Nearest neighbor queries. ACM SIGMOD, pp. 71-79, San Jose, CA, May 1995. Acknowledgements T.K. Sellis, N. Roussopoulos, and C. Falout- [SRF87] This work has been funded by the EC ESPRIT LTR Proj. SOS.The Rt-tree: A dynamic index for multi- 9141 HERMES. P. Zezula has also been supported by dimensional objects. 13th VLDB, pp. 507- Grants GACR 102/96/0986 and KONTAKT PM96 S028. 518, Brighton, England, Sept. 1987. [Uhf911 J.K. Uhlmann. Satisfying general proxim- References ity/similarity queries with metric trees. Inf. Proc. Lett., 40(4):175-179, Nov. 1991. R. Agrawal, C. Faloutsos, and A. Swami. Effi- [AFS93] [VM95] M. Vassilakopoulos and Y. Manolopoulos. cient similarity search in sequence databases. Dynamic inverted quadtree: A structure for FODO’93, pp. 69-84, Chicago, IL, Oct. 1993. pictorial databases. In. Sys., 20(6):483-500, Springer LNCS, Vol. 730. Sept. 1995. S. Berchtold, D.A. Keim, and H.-P. Kriegel. [BKK96] [WBKW96] E. Wold, T. Blum, D. Keislar, and An index structure for high- The X-tree: J. Wheaton. Content-based classification, dimensional data. 2&nd VLDB, pp. 28-39, search, and retrieval of audio. IEEE Multi- Mumbai (Bombay), India, Sept. 1996. media, 3(3):27-36, 1996. N. Beckmann, H.-P. Kriegel, R. Schneider, [BKSSSO] P. Zezula, P. Ciaccia, and F. Rabitti. M- [ZCR96] and B. Seeger. The R.-tree: An efficient and tree: A dynamic index for similarity queries in robust access method for points and rectan- multimedia databases. TR 7, HERMES ES- gles. ACM SIGMOD, pp. 322-331, Atlantic PRIT LTR Project, 1996. Available at URL City, NJ, May 1990. http://uwu.ced.tuc.gr/hermes/. 5We are now implementing M-tree in the PostgreSQL system. 435