SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                                                                 1




                  Efficient Computation of Range Aggregates
                  against Uncertain Location Based Queries
                                    Ying Zhang1 Xuemin Lin1,2 Yufei Tao3 Wenjie Zhang1 Haixun Wang4
                                       1
                                     University of New South Wales, {yingz,lxue, zhangw}@cse.unsw.edu.au 2 NICTA
                   3
                       Chinese University of Hong Kong, taoyf@cse.cuhk.edu.hk 4 Microsoft Research Asia, haixunw@microsoft.com


                Abstract—In many applications, including location based services, queries may not be precise. In this paper, we study the problem
                of efficiently computing range aggregates in a multidimensional space when the query location is uncertain. Specifically, for a query
                point Q whose location is uncertain and a set S of points in a multi-dimensional space, we want to calculate the aggregate (e.g., count,
                average and sum) over the subset S of S such that for each p ∈ S , Q has at least probability θ within the distance γ to p. We
                propose novel, efficient techniques to solve the problem following the filtering-and-verification paradigm. In particular, two novel filtering
                techniques are proposed to effectively and efficiently remove data points from verification. Our comprehensive experiments based on
                both real and synthetic data demonstrate the efficiency and scalability of our techniques.

                Index Terms—Uncertainty, Index, Range aggregate query

                                                                                        ✦


         1      I NTRODUCTION                                                               p5 will be destroyed. Similarly, objects p2 , p3 and p6 will
                                                                                            be destroyed if the actual falling point is q2 . In this appli-




                                                                                                    t.c om
                                                                                                       om
         Query imprecision or uncertainty may be often caused                               cation, the risk of civilian casualties may be measured by


                                                                                                  po t.c
         by the nature of many applications, including location                             the total number n of civilian objects which are within γ
                                                                                                gs po
         based services. The existing techniques for processing                             distance away from a possible blast point with at least
                                                                                              lo s
                                                                                            .b og
         location based spatial queries regarding certain query                             θ probability. Note that the probabilistic threshold is set
                                                                                          ts .bl


         points and data points are not applicable or inefficient
                                                                                        ec ts


                                                                                            by the commander based on the levels of trade-off that
                                                                                      oj c




         when uncertain queries are involved. In this paper, we
                                                                                    pr oje




                                                                                            she wants to make between the risk of civilian damages
                                                                                  re r




         investigate the problem of efficiently computing distance                           and the effectiveness of military attacks; for instance, it is
                                                                                lo rep




         based range aggregates over certain data points and                                unlikely to cause civilian casualties if n = 0 with a small
                                                                              xp lo
                                                                            ee xp




         uncertain query points as described in the abstract. In                            θ. Moreover, different weight values may be assigned
                                                                         .ie ee




         general, an uncertain query Q is a multi-dimensional                               to these target points and hence the aggregate can be
                                                                        w e
                                                                       w .i




         point that might appear at any location x following                                conducted based on the sum of the values.
                                                                      w w
                                                                   :// w




         a probabilistic density function pdf (x) within a region
                                                                tp //w




         Q.region. There is a number of applications where a
                                                              ht ttp:




                                                                                                                     p1
                                                                                                                                                        p3
         query point may be uncertain. Below are two sample
                                                               h




                                                                                                            γ             q1
         applications.                                                                                                                                        γ
                                                                                                                                   a             q2
         Motivating Application 1. A blast warhead carried by                                                   p5                          p2          p6
         a missile may destroy things by blast pressure waves in                                                               Q
         its lethal area where the lethal area is typically a circular                                                                                                  p4
                                                                                                                                       p7
         area centered at the point of explosion (blast point) with
                                                                                                 Q : s h a d o w e d re g i o n t o i n d i c a t e t h e p o s s i b l e l o c a t i o n s o f t h e q u e ry
         radius γ [24] and γ depends on the explosive used.                                   q1, q 2 : to i n d i c a te tw o p o s s i b l e l o c a ti o n s o f Q
         While firing such a missile, even the most advanced                                      γ     : q u e ry d i s t a n c e
         laser-guided missile cannot exactly hit the aiming point
         with 100% guarantee. The actual falling point (blast                               Fig. 1. Missile Example
         point) of a missile blast warhead regarding a target
         point usually follows some probability density functions                           Motivating Application 2. Similarly, we can also esti-
         (P DF s); different P DF s have been studied in [24] where                         mate the effectiveness of a police vehicle patrol route
         bivariate normal distribution is the simplest and the most                         using range aggregate against uncertain location based
         common one [24]. In military applications, firing such                              query Q. For example, Q in Fig. 1 now corresponds
         a missile may not only destroy military targets but may                            to the possible locations of a police patrol vehicle in a
         also damage civilian objects. Therefore, it is important to                        patrol route. A spot (e.g., restaurant, hotel, residential
         avoid the civilian casualties by estimating the likelihood                         property), represented by a point in {p1 , p2 , . . . , p7 } in
         of damaging civilian objects once the aiming point of a                            Fig. 1, is likely under reliable police patrol coverage [11]
         blast missile is determined. As depicted in Fig. 1, points                         if it has at least θ probability within γ distance to a
         {pi } for 1 ≤ i ≤ 7 represent some civilian objects (e.g.,                         moving patrol vehicle, where γ and θ are set by domain
         residential buildings, public facilities ). If q1 in Fig. 1 is                     experts. The number of spots under reliable police patrol
         the actual falling point of the missile, then objects p1 and                       coverage is often deployed to evaluate the effectiveness



Digital Object Indentifier 10.1109/TKDE.2011.46                        1041-4347/11/$26.00 © 2011 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        2


       of the police patrol route.                                                             tion to be stored. Both of them can be applied to
         Motivated by the above applications, in the paper we                                  continuous case and discrete case.
       study the problem of aggregate computation against the                               • Extensive experiments are conducted to demon-
       data points which have at least probability θ to be within                              strate the efficiency of our techniques.
       distance γ regarding an uncertain location based query.                              • While we focus on the problem of range counting for
                                                                                               uncertain location based queries in the paper, our
       Challenges. A naive way to solve this problem is that                                   techniques can be immediately extended to other
       for each data point p ∈ S, we calculate the probability,                                range aggregates.
       namely falling probability, of Q within γ distance to p,
                                                                                            The remainder of the paper is organized as follows.
       select p against a given probability threshold, and then
                                                                                          Section 2 formally defines the problem and presents
       conduct the aggregate. This involves the calculation of
                                                                                          preliminaries. In Section 3, following the filtering-and-
       an integral regarding each p and Q.pdf for each p ∈ S;
                                                                                          verification framework, we propose three filtering tech-
       unless Q.pdf has a very simple distribution (e.g., uniform
                                                                                          niques. Section 4 evaluates the proposed techniques with
       distributions), such a calculation may often be very ex-
                                                                                          extensive experiments. Then some possible extensions of
       pensive and the naive method may be computationally
                                                                                          our techniques are discussed in Section 5. This is fol-
       prohibitive when a large number of data points is in-
                                                                                          lowed by related work in Section 6. Section 7 concludes
       volved. In the paper we target the problem of efficiently
                                                                                          the paper.
       computing range aggregates against an uncertain Q for
       arbitrary Q.pdf and Q.region. Note that when Q.pdf is
       a uniform distribution within a circular region Q.region,                          2     BACKGROUND I NFORMATION
       a circular “window” can be immediately obtained ac-                                We first formally define the problem in Section 2.1, then
       cording to γ and Q.region so that the computation of                               Section 2.2 presents the PCR technique [26] which is em-
       range aggregates can be conducted via the window                                   ployed in the filtering technique proposed in Section 3.3.
       aggregates [27] over S.




                                                                                                   t.c om
                                                                                                  Notation             Definition




                                                                                                      om
                                                                                                 po t.c
                                                                                                        Q              uncertain location based query
       Contributions. Our techniques are developed based on
                                                                                               gs po
                                                                                                        S              a set of points
       the standard filtering-and-verification paradigm. We first                               lo s       q              instance of an uncertain query Q
                                                                                           .b og
       discuss how to apply the existing probabilistically con-                                         d              dimensionality
                                                                                         ts .bl

                                                                                                       Pq              the probability of the q to appear
       strained regions (PCR) technique [26] to our problem.
                                                                                       ec ts



                                                                                                   θ and γ             probabilistic threshold and query distance
                                                                                     oj c




       Then, we propose two novel distance based filtering
                                                                                   pr oje




                                                                                                Pf all (Q, p, γ)       the falling probability of p regarding
       techniques, statistical filtering (STF) and anchor point
                                                                                 re r




                                                                                                                       Q and γ
                                                                               lo rep




       filtering (APF), respectively, to address the inherent lim-                                 Qθ,γ (S)             {p|p ∈ S ∧ Pf all (Q, p, γ) ≥ θ}
                                                                             xp lo




                                                                                                p, x, y, b(S)          point (a set of data points)
                                                                           ee xp




       its of the PCR technique. The basic idea of the STF                                            e                R tree entry
                                                                        .ie ee




       technique is to bound the falling probability of the points                                  Cp,r               a circle(sphere) centred at p with radius r
                                                                       w e
                                                                      w .i




       by applying some well known statistical inequalities                                        δ(x, y)             the distance between x and y
                                                                     w w




                                                                                              δmax(min) (r1 , r2 )
                                                                  :// w




       where only a small amount of statistic information about                                                        the maximal(minimal) distance
                                                               tp //w




                                                                                                                       between two rectangular regions
       the uncertain location based query Q is required. The
                                                             ht ttp:




                                                                                                      gQ               mean of Q
       STF technique is simple and space efficient (only d + 2                                         ηQ               weighted average distance of Q
                                                              h




       float numbers required where d denotes the dimension-                                           σQ               variance of Q
                                                                                                                       arbitrarily small positive constant value
       ality), and experiments show that it is effective. For the                                      a               anchor point
       scenarios where a considerable “large” space is available,                                    nap               the number of anchor points
       we propose a view based filter which consists of a set of                                  LPf all (p, γ)        lower bound of the Pf all (p, γ)
       anchor points. An anchor point may reside at any location                                 U Pf all (p, γ)       upper bound of the Pf all (p, γ)
                                                                                                      nd               the number of different distances
       and its falling probability regarding Q is pre-computed                                                         pre-computed for each anchor point
       for several γ values. Then many data points might be                                           Da               a set of distance values used by
       effectively filtered based on their distances to the anchor                                                      anchor point a
       points. For a given space budget, we investigate how to                                                            TABLE 1
       construct the anchor points and their accessing orders.                                                     The summary of notations.
          To the best of our knowledge, we are the first to
       identify the problem of computing range aggregates
       against uncertain location based query. In this paper, we                          2.1 Problem Definition
       investigate the problem regarding both continuous and                              In the paper, S is a set of points in a d-dimensional
       discrete Q. Our principle contributions can be summa-                              numerical space. The distance between two points x and
       rized as follows.                                                                  y is denoted by δ(x, y). Note that techniques developed
                                                                                          in the paper can be applied to any distance metrics [5].
          •   We propose two novel filtering techniques, STF and
                                                                                          In the examples and experiments, the Euclidean distance
              APF, respectively. The STF technique has a decent
                                                                                          is used. For two rectangular regions r1 and r2 , we have
              filtering power and only requires the storage of very
                                                                                          δmax (r1 , r2 ) = max∀x∈r1 ,y∈r2 δ(x, y) and
              limited pre-computed information. APF provides
              the flexibility to significantly enhance the filtering                                                   0                      if r1 ∩ r2 = ∅
                                                                                          δmin (r1 , r2 ) =                                                            (1)
              power by demanding more pre-computed informa-                                                         min∀x∈r1 ,y∈r2 δ(x, y) otherwise
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        3


         An uncertain (location based) query Q may be de-                                 max, sum, avg, etc., over some non-locational attributes
       scribed by a continuous or a discrete distribution as fol-                         (e.g., weight value of the object in missile example).
       lows.                                                                              Example 1. In Fig. 2, S = {p1 , p2 , p3 } and Q = {q1 , q2 , q3 }
       Definition 1 (Continuous Distribution). An uncertain                                where Pq1 = 0.4, Pq2 = 0.3 and Pq3 = 0.3. According
       query Q is described by a probabilistic density function Q.pdf .                   to Definition 3, for the given γ, we have Pf all (p1 , γ) =
       Let Q.region represent the region where Q might appear, then                       0.4, Pf all (p2 , γ) = 1, and Pf all (p3 , γ) = 0.6. Therefore,
        x∈Q.region
                   Q.pdf (x)dx = 1;                                                       Qθ,γ (S) = {p2 , p3 } if θ is set to 0.5, and hence |Qθ,γ (S)| = 2.
       Definition 2 (Discrete Distribution). An uncertain query
       Q consists of a set of instances (points) {q1 , q2 , . . . , qn } in a d-          2.2     Probabilistically Constrained Regions (PCR)
       dimensional numerical space where qi appears with probability
       Pqi , and q∈Q Pq = 1;                                                              In [26], Tao et al. study the problem of range query on
                                                                                          uncertain objects, in which the query is a rectangular
         Note that, in Section 5 we also cover the applications
                                                                                          window and the location of each object is uncertain.
       where Q can have a non-zero probability to be absent;
                                                                                          Although the problem studied in [26] is different with
       that is, x∈Q.region Q.pdf (x)dx = c or q∈Q Pq = c for a
                                                                                          the one in this paper, in Section 3.3 we show how
       c < 1.
                                                                                          to modify the techniques developed in [26] to support
         For a point p, we use Pf all (Q, p, γ) to represent the
                                                                                          uncertain location based query.
       probability of Q within γ distance to p, called falling
                                                                                             In the following part, we briefly introduce the Prob-
       probability of p regarding Q and γ. It is formally defined
                                                                                          abilistically Constrained Region (PCR) technique devel-
       below.
                                                                                          oped in [26]. Same as the uncertain location based query,
         For continuous cases,
                                                                                          an uncertain object U is modeled by a probability density
                                                                                          function U.pdf (x) and an uncertain region U.region.
          Pf all (Q, p, γ) =                                   Q.pdf (x)dx        (2)
                                   x∈Q.region ∧ δ(x,p)≤γ                                  The probability that the uncertain object U falls in the
                                                                                          rectangular window query rq , denoted by Pf all (U, rq ),




                                                                                                     t.c om
                                                                                                        om
          For discrete cases,                                                             is defined as x∈U.region∩rq U.pdf (x)dx. In [26], the prob-


                                                                                                   po t.c
                                                                                                 gs po
                     Pf all (Q, p, γ) =                        Pq                 (3)     abilistically constrained region of the uncertain object
                                                                                               lo s
                                                                                             .b og
                                             q∈Q ∧ δ(q,p)≤γ
                                                                                          U regarding probability θ (0 ≤ θ ≤ 0.5), denoted by
                                                                                           ts .bl

                                                                                          U.pcr(θ), is employed in the filtering technique. Partic-
                                                                                         ec ts



          In the paper hereafter, Pf all (Q, p, γ) is abbreviated to
                                                                                       oj c




                                                                                          ularly, U.pcr(θ) is a rectangular region constructed as
                                                                                     pr oje




       Pf all (p, γ), and Q.region and Q.pdf are abbreviated to Q                         follows.
                                                                                   re r
                                                                                 lo rep




       and pdf respectively, whenever there is no ambiguity. It                              For each dimension i, the projection of U.pcr(θ)
                                                                               xp lo




       is immediate that Pf all (p, γ) is a monotonically increas-                                                     [U.pcri− (θ), U.pcri+ (θ)]
                                                                             ee xp




                                                                                          is      denoted       by                                   where
                                                                          .ie ee




       ing function with respect to distance γ.
                                                                                           x∈U.region&xi ≤U.pcri− (θ)  U.pdf (x)dx        =       θ     and
                                                                         w e
                                                                        w .i




                                                                                           x∈U.region&xi ≥U.pcri+ (θ) U.pdf (x)dx      = θ. Note that
                                                                       w w




                                    γ
                                                                    :// w




                                                           γ                              xi represents the coordinate value of the point x
                                                                 tp //w




                              p1        q1                                                on i-th dimension. Then U.pcr(θ) corresponds to
                                                               ht ttp:




                                              Q       q2 p3                               a rectangular region [U.pcr− (θ), U.pcr+ (θ)] where
                                                                h




                                               p2    q3                                   U.pcr− (θ) (U.pcr+ (θ)) is the lower (upper) corner and
                                                                                          the coordinate value of U.pcr− (θ) (U.pcr+ (θ)) on i-th
                                               γ
                                                                                          dimension is U.pcri− (θ) (U.pcri+ (θ)). Fig. 3(a) illustrates
                                                                                          the U.pcr(0.2) of the uncertain object U in 2 dimensional
                                                                                          space. Therefore, the probability mass of U on the left
       Fig. 2. Example of Pf all (Q, p, γ)
                                                                                          (right) side of l1− (l1+ ) is 0.2 and the probability mass of
       Problem Statement.                                                                 U below (above) the l2− (l2+ ) is 0.2 as well. Following
       In many applications, users are only interested in the                             is a motivating example of how to derive the lower and
       points with falling probabilities exceeding a given prob-                          upper bounds of the falling probability based on PCR.
       abilistic threshold regarding Q and γ. In this paper we                            Example 2. According to the definition of PCR, in Fig. 3(b)
       investigate the problem of probabilistic threshold based                           the probabilistic mass of U in the shaded area is 0.2, i.e.,
       uncertain location range aggregate query on points data;                            x∈U.region&x1 ≥U.pcr1+ (θ)
                                                                                                                        U.pdf (x)dx = 0.2. Then, it is im-
       it is formally defined below.                                                       mediate that Pf all (U, rq1 ) < 0.2 because rq1 does not intersect
       Definition 3. [Uncertain Range Aggregate Query] Given                               U.pcr(0.2). Similarly, we have Pf all (U, rq2 ) ≥ 0.2 because the
       a set S of points, an uncertain query Q, a query distance                          shaded area is enclosed by rq2 .
       γ and a probabilistic threshold θ, we want to compute an                              The following theorem [26] formally introduces how
       aggregate function (e.g., count, avg, and sum) against points                      to prune or validate an uncertain object U based on
       p ∈ Qθ,γ (S), where Qθ,γ (S) denotes a subset of points                            U.pcr(θ) or U.pcr(1 − θ). Note that we say an uncertain
       {p} ⊆ S such that Pf all (p, γ) ≥ θ.                                               object is pruned (validated) if we can claim Pf all (U, rq ) < θ
          In this paper, our techniques will be presented based                           (Pf all (U, rq ) ≥ θ) based on the P CR.
       on the aggregate count. Nevertheless, they can be imme-                            Theorem 1. Given an uncertain object U , a range query rq
       diately extended to cover other aggregates, such as min,                           (rq is a rectangular window) and a probabilistic threshold θ.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        4


                                                                                          shows that techniques proposed in this section can be
                          U region                                                        immediately applied to the discrete case.

                                                            rq 2
                l 2+                                                                      3.1 A framework for filtering-and-verification Algo-
                                                                                          rithm

                l 2-                                                                      In this subsection, following the filtering-and-verification
                                                                                          paradigm we present a general framework to support
                                            l 1+                                          uncertain range aggregate query based on the filtering
                       l 1-
                                                                              rq1         technique. To facilitate the aggregate query computation,
                              U .pcr(0.2)          U .pcr(0.2)        l 1+                we assume a set S of points is organized by an aggregate
                               (a) PCR               (b) PCR based filtering
                                                                                          R-Tree [22], denoted by RS . Note that an entry e of
                                                                                          RS might be a data entry or an intermediate entry
       Fig. 3. A 2d probabilistically constrained region (PCR (0.2))                      where a data entry corresponds to a point in S and an
         1) For θ > 0.5, U can be pruned if rq does not fully contain                     intermediate entry groups a set of data entries or child
             U.pcr(1 − θ);                                                                intermediate entries. Assume a filter, denoted by F , is
         2) For θ ≤ 0.5, the pruning condition is that rq does not                        available to prune or validate a data entry (i.e., a point)
             intersect U.pcr(θ);                                                          or an intermediate entry (i.e., a set of points).
         3) For θ > 0.5, the validating criterion is that rq com-                            Algorithm 1 illustrates the framework of the filtering-
             pletely contains the part of Umbb on the right (left) of                     and-verification Algorithm. Note that details of the fil-
             plane U.pcri− (1−θ) (U.pcri+ (1−θ)) for some i ∈ [1, d],                     tering techniques will be introduced in the following
             where Umbb is the minimal bounding box of uncertain                          subsections. The algorithm consists of two phases. In the
             region U.region;                                                             filtering phase (Line 3-16), for each entry e of RS to be




                                                                                                         t.c om
         4) For θ ≤ 0.5 the validating criterion is that rq completely                    processed, we do not need to further process e if it is




                                                                                                            om
                                                                                                       po t.c
             contains the part of Umbb on the left (right) of plane                       pruned or validated by the filter F . We say an entry e is
                                                                                                     gs po
             U.pcri− (θ) (U.pcri+ (θ)) for some i ∈ [1, d];                               pruned (validated) if the filter can claim Pf all (p, γ) < θ
                                                                                                   lo s
                                                                                                 .b og
                                                                                          (Pf all (p, γ) ≥ θ) for any point p within embb . The counter
                                                                                               ts .bl


       3      Filtering-and-Verification A LGORITHM                                        cn is increased by |e| (Line 6) if e is validated where
                                                                                             ec ts
                                                                                           oj c




                                                                                          |e| denotes the aggregate value of e (i.e., the number
                                                                                         pr oje




       According to the definition of falling probability (i.e.,
                                                                                          of data points in e). Otherwise, the point p associated
                                                                                       re r




       Pf all (p, γ)) in Equation 2, the computation involves in-
                                                                                     lo rep




                                                                                          with e is a candidate point if e corresponds to a data
                                                                                   xp lo




       tegral calculation, which may be expensive in terms of
                                                                                          entry (Line 10), and all child entries of e are put into the
                                                                                 ee xp




       CPU cost. Based on Definition 3, we only need to know
                                                                              .ie ee




                                                                                          queue for further processing if e is an intermediate entry
       whether or not the falling probability of a particular point
                                                                             w e




                                                                                          (Line 12). The filtering phase terminates when the queue
                                                                            w .i
                                                                           w w




       regarding Q and γ exceeds the probabilistic threshold
                                                                        :// w




                                                                                          is empty. In the verification phase (Line 17-21), candidate
                                                                     tp //w




       for the uncertain aggregate range query. This motivates
                                                                                          points are verified by the integral calculations according
                                                                   ht ttp:




       us to follow the filtering-and-verification paradigm for the
                                                                                          to Equation 2.
                                                                    h




       uncertain aggregate query computation. Particularly, in
       the filtering phase, effective and efficient filtering tech-                          Cost Analysis. The total time cost of Algorithm 1 is as
       niques will be applied to prune or validate the points. We                         follows.
       say a point p is pruned (validated) regarding the uncertain                                Cost =          Nf × Cf + Nio × Cio + Nca × Cvf                    (4)
       query Q, distance γ and probabilistic threshold θ if we
       can claim that Pf all (p, γ) < θ ( Pf all (p, γ) ≥ θ ) based on                    Particularly, Nf represents the number of entries being
       the filtering techniques without explicitly computing the                           tested by the filter on Line 5 and Cf is the time cost
       Pf all (p, γ). The points that cannot be pruned or validated                       for each test. Nio denotes the number of nodes (pages)
       will be verified in the verification phase in which their                            accessed (Line 13) and Cio corresponds to the delay of
       falling probabilities are calculated. Therefore, it is desirable                   each node (page) access of RS . Nca represents the size
       to develop effective and efficient filtering techniques to                           of candidate set C and Cvf is the computation cost for
       prune or validate points such that the number of points                            each verification (Line 15) in which numerical integral
       being verified can be significantly reduced.                                         computation is required. With a reasonable filtering time
          In this section, we first present a general framework                            cost (i.e., Cvf ), the dominant cost of Algorithm 1 is
       for the filtering-and-verification Algorithm based on fil-                            determined by Nio and Nca because Cio and Cvf might
       tering techniques in Section 3.1. Then a set of filtering                           be expensive. Therefore, in the paper we aim to develop
       techniques are proposed. Particularly, Section 3.2 pro-                            effective and efficient filtering techniques to reduce Nca
       poses the statistical filtering technique. Then we investi-                         and Nio .
       gate how to apply the PCR based filtering technique in                              Filtering. Suppose there is no filter F in Algorithm 1,
       Section 3.3. Section 3.4 presents the anchor point based                           all points in S will be verified. Regarding the example
       filtering technique.                                                                in Fig. 4, 5 points p1 , p2 , p3 , p4 and p5 will be veri-
          For presentation simplicity, we consider the continuous                         fied. A straitforward filtering technique is based on the
       case of the uncertain query in this section. Section 3.5                           minimal and maximal distances between the minimal
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        5



       Algorithm 1 Filtering-and-Verification(RS , Q, F , γ, θ)                                          in Algorithm 1 is significantly increased. In the following
                                                                                                        subsections, we present three filtering techniques, named
       Input: RS : an aggregate R tree on data set S,
                                                                                                        STF, PCR and APF respectively, which can significantly
            Q : uncertain query, F : Filter, γ : query distance,
                                                                                                        enhance the filtering capability of the filter.
            θ : probabilistic threshold.
       Output: |Qθ,γ (S)|                                                                               3.2 Statistical Filter
       Description:
         1: Queue := ∅; cn := 0; C := ∅;                                                                In this subsection, we propose a statistical filtering tech-
         2: Insert root of RS into Queue;                                                               nique, namely STF. After introducing the motivation
         3: while Queue = ∅ do                                                                          of the technique, we present some important statistic
         4:    e ← dequeue from the Queue;                                                              information of the uncertain query and then show how
         5:    if e is validated by the filter F then                                                    to derive the lower and upper bounds of the falling
         6:       cn := cn + |e|;                                                                       probability of a point regarding an uncertain query Q,
         7:    else                                                                                     distance γ and probabilistic threshold θ.
         8:       if e is not pruned by the filter F then                                                Motivation. As shown in Fig. 5, given an uncertain
         9:          if e is data entry then                                                            query Q1 and γ we cannot prune point p based on the
        10:             C := C ∪ p where p is the data point e                                          MMD technique, regardless of the value of θ, although
                        represented;                                                                    intuitively the falling probability of p regarding Q1 is
        11:          else                                                                               likely to be small. Similarly, we cannot validate p for Q2 .
        12:             put all child entries of e into Queue;                                          This motivates us to develop a new filtering technique
        13:          end if                                                                             which is as simple as MMD, but can exploit θ to en-
        14:       end if                                                                                hance the filtering capability. In the following part, we
        15:    end if                                                                                   show that lower and upper bounds of Pf all (p, γ) can be
        16: end while                                                                                   derived based on some statistics of the uncertain query.




                                                                                                                        t.c om
                                                                                                                           om
        17: for each point p ∈ C do                                                                     Then a point may be immediately pruned (validated)


                                                                                                                      po t.c
        18:    if Pf all (Q, p, γ) ≥ θ then                                                             based on the upper(lower) bound of Pf all (p, γ), denoted
                                                                                                                    gs po
        19:       cn := cn + 1;                                                                         by U Pf all (p, γ) (LPf all (p, γ)).
                                                                                                                  lo s
                                                                                                                .b og
        20:    end if                                                                                   Example 4. In Fig. 5 suppose θ = 0.5 and we have
                                                                                                              ts .bl


                                                                                                        U Pf all (Q1 , p, γ) = 0.4 (LPf all (Q2 , p, γ) = 0.6) based on
                                                                                                            ec ts


        21: end for
                                                                                                          oj c
                                                                                                        pr oje




        22: Return cn                                                                                   the statistical bounds, then p can be safely pruned (vali-
                                                                                                      re r




                                                                                                        dated) without explicitly computing its falling probability
                                                                                                    lo rep




       bounding boxes(MBBs) of an entry and the uncertain
                                                                                                        regarding Q1 (Q2 ). Regarding the running example in Fig.4,
                                                                                                  xp lo




       query. Clearly, for any θ we can safely prune an entry if
                                                                                                ee xp




                                                                                                        suppose θ = 0.2 and we have U Pf all (p2 , γ) = 0.15 while
       δmin (Qmbb , embb ) > γ or validate it if δmax (Qmbb , embb ) ≤
                                                                                             .ie ee




                                                                                                        U Pf all (pi , γ) ≥ 0.2 for 3 ≤ i ≤ 5, then p2 is pruned. There-
                                                                                            w e




       γ. We refer this as maximal/minimal distance based filter-
                                                                                           w .i




                                                                                                        fore, three points (p3 , p4 and p5 ) are verified in Algorithm 1
                                                                                          w w




       ing technique, namely MMD. MMD technique is time ef-
                                                                                       :// w




                                                                                                        when MMD and statistical filtering techniques are applied.
                                                                                    tp //w




       ficient as it takes only O(d) time to compute the minimal
                                                                                  ht ttp:




       and maximal distances between Qmbb and embb . Recall                                                                 Q1
                                                                                                                                       γ
                                                                                   h




       that Qmbb is the minimal bounding box of Q.region.                                                                                    Q2
                                                                                                                             g Q1
                                                                                                                                        p    g Q2
                                          p1
                                                        p5
                                                                             p4
                                                                                                        Fig. 5. Motivation Example
                                     p2
                                                                                                        Statistics of the uncertain query
                                                                                p3
                                                                                                          To apply the statistical filtering technique, the follow-
                                                                                                        ing statistics of the uncertain query Q are pre-computed.
                            Q : u n c er t ai n l oc at i on b as ed r an g e q u er y                  Definition 4 (mean (gQ )). gQ = x∈Q x × Q.pdf (x)dx.
       Fig. 4. Running Example                                                                          Definition 5 (weighted average distance (ηQ )). ηQ equals
       Example 3. As shown in Fig. 4, suppose the MMD filtering                                           x∈Q δ(x, gQ ) × Q.pdf (x)dx
       technique is applied in Algorithm 1, then p1 is pruned and                                       Definition 6 (variance (σQ )). σQ equals x∈Q δ(x, gQ )2 ×
       the other 4 points p2 , p3 , p4 and p5 will be verified.                                          Q.pdf (x)dx
          Although the MMD technique is very time efficient, its                                         Derive lower and upper bounds of Pf all (p, γ).
       filtering capacity is limited because it does not make use                                          For a point p ∈ S, the following theorem shows how
       of the distribution information of the uncertain query Q                                         to derive the lower and upper bounds of Pf all (p, γ)
       and the probabilistic threshold θ. This motivates us to de-                                      based on above statistics of Q. Then, without explicitly
       velop more effective filtering techniques based on some                                           computing Pf all (p, γ), we may prune or validate the point
       pre-computations on the uncertain query Q such that the                                          p based on U Pf all (p, γ) and LPf all (p, γ) derived based on
       number of entries (i.e., points ) being pruned or validated                                      the statistics of Q.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        6


       Theorem 2. Given an uncertain query Q and a distance γ,                            σ = V ar(Y ), according to Lemma 1 when γ > μ we
       and suppose the mean gQ , weighted average distance ηQ and                         have
       variance σQ of Q are available. Then for a point p, we have                                                           1
                                                1                                              P r(Y ≤ γ) ≤ P (Y ≥ γ ) ≤                (6)
         1) If γ > μ1 , Pf all (p, γ) ≥ 1 −   (γ−μ1 )2
                                                       , where μ1 =                                                         (γ −μ )2
                                                                                                                         1+ σ2
                                                       1+      2
                                                              σ1
                                 2         2
             δ(gQ , p) + ηQ and σ1 = σQ − ηQ + 4ηQ × δ(gQ , p).                             Because values of μ, σ 2 , μ and σ 2 may change re-
                                                          1
          2) If γ < δ(gQ , p) − ηQ − , Pf all (p, γ) ≤  (γ −μ2 )2
                                                                  ,                       garding different point p ∈ S, we cannot pre-compute
                                                                        1+      2
                                                                               σ2         them. Nevertheless, in the following part we show that
               where   μ2
                        2 =Δ+              2
                                          = σQ −
                                     ηQ , σ2                 2
                                                      + 4ηQ × Δ, Δ =
                                                            ηQ                            their upper bounds can be derived based on the statistic
               γ + γ + − δ(p, gQ ) and γ > 0. The represents an                           information of the Q, which can be pre-computed based
               arbitrarily small positive constant value.                                 on the probabilistic distribution of Q.
          Before the proof of Theorem 2, we first introduce the                                                                          p
       Cantelli’s Inequality [19] described by Lemma 1 which is
                                                                                                                            Q                 γ
       one-sided version of the Chebyshev Inequality.
       Lemma 1. Let X be an univariate random variable with the
       expected value μ and the finite variance σ 2 . Then for any
                                           1
       C > 0, P r(X − μ ≥ C × σ) ≤ 1+C 2 .                                                                                       gQ
                                                                                                                                              ε
          Following is the proof of Theorem 2.
            Proof: Intuition of the Proof. For a given point p ∈ S,                                                p'                    γ'
       its distance to Q can be regarded as an univariate random
       variable Y , and we have Pf all (p, γ) = P r(Y ≤ γ). Given                         Fig. 6. Proof of Upper bound
       γ, we can derive the lower and upper bounds of P r(Y ≤                               Based on the triangle inequality, for any x ∈ Q we
       γ) (Pf all (p, γ)) based on the statistical inequality in                          have δ(x, p) ≤ δ(x, gQ )+δ(p, gQ ) and δ(x, p) ≥ | δ(x, gQ )−




                                                                                                   t.c om
                                                                                                      om
       Lemma 1 if the expectation (E(Y )) and variance(V ar(Y ))                          δ(p, gQ ) | for any x ∈ Q. Then we have


                                                                                                 po t.c
       of the random variable Y are available. Although E(Y )
       and V ar(Y ) take different values regarding different                               μ =gs po          y × Y.pdf (y)dy =               δ(x, p) × pdf (x)dx
                                                                                             lo s
                                                                                           .b og
       points, we show that the upper bounds of E(Y ) and                                               y∈Y                             x∈Q
                                                                                         ts .bl
                                                                                       ec ts



       V ar(Y ) can be derived based on mean(gQ ), weighted                                      ≤            (δ(p, gQ ) + δ(x, gQ )) × pdf (x)dx
                                                                                     oj c
                                                                                   pr oje




       average distance (ηQ ) and variance(σQ ) of the query Q.                                         x∈Q
                                                                                 re r




                                                                                                 ≤ δ(gQ , p) + ηQ = μ1
                                                                               lo rep




       Then, the correctness of the theorem follows.
                                                                             xp lo




       Details of the Proof. The uncertain query Q is a ran-
                                                                           ee xp




                                                                                          and
       dom variable which equals x ∈ Q.region with prob-
                                                                        .ie ee




       ability Q.pdf (x). For a given point p, let Y denote                                       σ2    =     E(Y 2 ) − E 2 (Y )
                                                                       w e
                                                                      w .i
                                                                     w w




       the distance distribution between p and Q; that is,
                                                                  :// w




       Y is an univariate random variable and Y.pdf (l) =                                               ≤               (δ(gQ , p) + δ(x, gQ ))2 pdf (x)dx
                                                               tp //w




                                                                                                                  x∈Q
                                                             ht ttp:




         x∈Q.region and δ(x,p)=l Q.pdf (x)dx for any l ≥ 0. Conse-
                                                                                                              −(δ(gQ , p) − ηQ )2
                                                              h




       quently, we have Pf all (p, γ) = P r(Y ≤ γ) according to
       Equation 2. Let μ = E(Y ), σ 2 = V ar(Y ) and C = γ−μ ,  σ                                       =     2           δ(gQ , p) × δ(x, gQ ) × pdf (x)dx
       then based on lemma 1, if γ > μ we have                                                                     x∈Q

                                                                         1                                    +           δ(x, gQ )2 × pdf (x)dx
           P r(Y ≥ γ) =            P r(Y − μ ≥ C × σ) ≤
                                                                   1+   ( γ−μ )2
                                                                           σ
                                                                                                                   x∈Q
                                                                                                                                     2
                                                                                                              +2 × δ(gQ , p) × ηQ − ηQ
                                                                                                                    2                      2
       Then it is immediate that                                                                        =     σQ − ηQ + 4ηQ × δ(gQ , p) = σ1
                                                                   1                       Together with Inequality 5, we have P r(Y ≤ γ) ≥
           P r(Y ≤ γ) ≥ 1 − P r(Y ≥ γ) ≥ 1 −                       (γ−μ)2
                                                                                    (5)
                                                             1+                                 1                1
                                                                     σ2                   1−   (γ−μ)2
                                                                                                      ≥ 1−     (γ−μ1 )2
                                                                                                                        if μ1 < γ. With similar
                                                                                                1+     σ2
                                                                                                                          1+     2
                                                                                                                                σ1
          According to Inequation 5 we can derive the lower                               rationale, let Δ = δ(gQ , p ) = γ + γ + − δ(p, gQ ) we
       bound of Pf all (p, γ). Next, we show how to derive                                have μ ≥ Δ + ηQ = μ2 and σ 2 ≤ σQ − ηQ + 4ηQ ×    2
       upper bound of Pf all (p, γ). As illustrated in Fig. 6,                                    2
                                                                                          Δ = σ2 . Based on Inequality 6, we have P r(Y ≤ γ) ≤
       let p denote a dummy point on the line pgQ with                                         1            1
                                                                                             (γ −μ )2
                                                                                                      ≤   (γ −μ2 )2
                                                                                                                    if γ < δ(gQ , p)− ηQ − . Therefore,
       δ(p , p) = γ + γ + where          is an arbitrarily small                          1+     σ 2
                                                                                                            1+      2
                                                                                                                   σ2

       positive constant value. Similar to the definition of                               the correctness of the theorem follows.
       Y , let Y be the distance distribution between p and                                 The following extension is immediate based on the
       Q; that is, Y is an univariate random variable where                               similar rationale of Theorem 2.
       Y .pdf (l) = x∈Q.region and δ(x,p )=l Q.pdf (x)dx for any                          Extension 1. Suppose r is a rectangular region, we can
       l ≥ 0. Then, as shown in Fig. 6, for any point x ∈ Q                               use δmin (r, gQ ) and δmax (r, gQ ) to replace δ(p, gQ ) in
       with δ(x, p ) ≤ γ (shaded area), we have δ(x, p) > γ. This                         Theorem 2 for lower and upper probabilistic bounds
       implies that P (Y ≤ γ) ≤ P (Y ≥ γ ). Let μ = E(Y ) and                             computation respectively.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                           7


          Based on Extension 1, we can compute the upper                                                     Q.pcr(0.2) according to Theorem 1 in Section 2.2. Conse-
       and lower bounds of Pf all (embb , γ) where embb is the                                               quently, only p3 and p5 go to the verification phase when
       minimal bounding box of the entry e, and hence prune                                                  Q.pcr(0.2) is available.
       or validate e in Algorithm 1. Since gQ , ηQ and σQ are                                                   Same as [26], [28], a finite number of P CRs are pre-
       pre-computed, the dominant cost in filtering phase is                                                  computed for the uncertain query Q regarding different
       the distance computation between embb and gQ which                                                    probability values. For a given θ at query time, if the
       is O(d).                                                                                              Q.pcr(θ) is not pre-computed we can choose two pre-
                                                                                                             computed P CRs Q.pcr(θ1 ) and Q.pcr(θ2 ) where θ1 (θ2 )
                                                                                                             is the largest (smallest) existing probability value smaller
       3.3 PCR based Filter                                                                                  (larger) than θ. We can apply the modified PCR tech-
       Motivation. Although the statistical filtering technique                                               nique as the filter in Algorithm 1, and the filtering time
       can significantly reduce the candidate size in Algo-                                                   regarding each entry tested is O(m+log(m)) in the worst
       rithm 1, the filtering capacity is inherently limited be-                                              case , where m is the number of P CRs pre-computed by
       cause only a small amount of statistics are employed.                                                 the filter.
       This motivates us to develop more sophisticated filtering                                                 The PCR technique can significantly enhance the filter-
       techniques to further improve the filtering capacity; that                                             ing capacity when a particular number of PCR s are pre-
       is, we aim to improve the filtering capacity with more                                                 computed. The key of the PCR filtering technique is to
       pre-computations (i.e., more information kept for the                                                 partition the uncertain query along each dimension. This
       filter). In this subsection, the PCR technique proposed                                                may inherently limit the filtering capacity of the PCR
       in [26] will be modified for this purpose.                                                             based filtering technique. As shown in Fig. 7, we have
                                                                                                             to use two rectangular regions for pruning and validation
                                                    R+ , p                   C p,                            purpose, and hence the Cp,γ is enlarged (shrunk) during
                               Q region
                                                                                                             the computation. As illustrated in Fig. 7, all instances of




                                                                                                                             t.c om
                                                                                                             Q in the striped area is counted for Pf all (p, γ) regarding




                                                                                                                                om
                                                                                                                           po t.c
                                                                   p
                                                                                                             R+,p , while all of them have distances larger than γ. Sim-
                                                                                                                         gs po
                                                                                                             ilar observation goes to R−,p . This limitation is caused
                                                                                                                       lo s
                                                                                                                     .b og
                                    Q . pcr ( 0 .4 )
                                                                                                             by the transformation, and cannot be remedied by in-
                                                                                                                   ts .bl


                                                                                                             creasing the number of P CRs. Our experiments also
                                                                                                                 ec ts



                                                              R
                                                                                                               oj c



                                                                  ,p
                                                                                                             confirm that the PCR technique cannot take advantage of
                                                                                                             pr oje




       Fig. 7. Transform query
                                                                                                           re r




                                                                                                             the large index space. This motivates us to develop new
                                                                                                         lo rep




                                                                                                             filtering technique to find a better trade-off between the
                                                                                                       xp lo




       PCR based Filtering technique. The PCR technique
                                                                                                     ee xp




                                                                                                             filtering capacity and pre-computation cost (i.e., index
       proposed in [26] cannot be directly applied for filtering
                                                                                                  .ie ee




                                                                                                             size).
                                                                                                 w e




       in Algorithm 1 because the range query studied in [26]
                                                                                                w .i
                                                                                               w w




       is a rectangular window and objects are uncertain. Nev-
                                                                                            :// w
                                                                                         tp //w




       ertheless we can adapt the PCR technique as follows.                                                  3.4 Anchor Points based Filter
                                                                                       ht ttp:




       As shown in Fig. 7, let Cp,γ represent the circle (sphere)
                                                                                        h




       centered at p with radius γ. Then we can regard the                                                   The anchor (pivot) point technique is widely employed
       uncertain query Q and Cp,γ as an uncertain object and                                                 in various applications, which aims to reduce the query
       the range query respectively. As suggested in [28], we                                                computation cost based on some pre-computed anchor
       can use R+,p (mbb of Cp,γ ) and R−,p (inner box) as                                                   (pivot) points. In this subsection, we investigate how to
       shown in Fig. 7 to prune and validate the point p based                                               apply anchor point technique to effectively and efficiently
       on the P CRs of Q respectively. For instance, if θ = 0.4                                              reduce the candidate set size. Following is a motivating
       the point p in Fig. 7 can be pruned according to case 2                                               example for the anchor point based filtering technique.
       of Theorem 1 because R1 does not intersect Q.pcr(0.4).                                                                                                      = 0.2
       Note that similar transformation can be applied for the                                                                               p1

       intermediate entries as well.                                                                                                                  p5
                                                                                                                                                                     p4
                                               R+ , p1        R+ , p3
                                               p1                                                                                       p2
                                                                        R+ , p5                                                                   o
                                                         p3                                                                                           d               p3
                                                                        p5

                                R+ , p2
                                                                                                                                Q.mbb                      Co ,d
                                                                             p4
                                          p2
                                                                             R+ , p4                         Fig. 9. Running example regarding the anchor point
                                      Q.mbb                   Q.pcr(0.2)                                     Motivating Example. Regarding our running example,
       Fig. 8. Running example                                                                               in Fig. 9 the shaded area, denoted by Co,d , is the circle
                                                                                                             centered at o with radius d. Suppose the probabilistic
       Example 5. Regarding the running example in Fig. 8,                                                   mass of Q in Co,d is 0.8, then when θ = 0.2 we can safely
       suppose Q.pcr(0.2) is pre-computed, then p1 , p2 and p4                                               prune p1 , p2 , p3 and p4 because Cpi ,γ does not intersect
       are pruned because R+,p1 , R+,p2 and R+,p4 do not overlap                                             Co,d for i = 1, 2, 3 and 4.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
                                                                                                                                                                        8


          In the paper, an anchor point a regarding the uncertain                         Ca,δ(a,p)−γ− ⊆ Ca,δ(a,p)+γ and Cp,γ ∩ Cp,δ(a,p)−γ− = ∅,
       query Q is a point in multidimensional space whose                                 this implies that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) −
       falling probability against different γ values are pre-                            Pf all (a, δ(a, p) − γ − ).
       computed. We can prune or validate a point based on its
       distance to the anchor point. For better filtering capability,                         Let LPf all (p, γ) and U Pf all (p, γ) denote the lower
       a set of anchor points will be employed.                                           and upper bounds derived from Lemma 2 regarding
          In the following part, Section 3.4.1 presents the anchor                        Pf all (p, γ). Then we can immediately validate a point p if
       point filtering technique. In Section 3.4.2, we investigate                         LPf all (p, γ) ≥ θ, or prune p if U Pf all (p, γ) < θ.
       how to construct anchor points for a given space budget,                              Clearly, it is infeasible to keep Pf all (a, l) for arbitrary
       followed by a time efficient filtering algorithm in Sec-                             l ≥ 0. Since Pf all (a, l) is a monotonic function with
       tion 3.4.3.                                                                        respect to l, we keep a set Da = {li } with size nd for each
                                                                                          anchor point such that Pf all (a, li ) = nid for 1 ≤ i ≤ nd .
       3.4.1     Anchor Point filtering technique (APF)                                    Then for any l > 0, we use U Pf all (a, l) and LPf all (a, l)
       For a given anchor point a regarding the uncertain query                           to represent the upper and lower bound of Pf all (a, l)
       Q, suppose Pf all (a, l) is pre-computed for arbitrary dis-                        respectively. Particularly, U Pf all (a, l) = Pf all (a, li ) where
       tance l. Lemma 2 provides lower and upper bounds                                   li is the smallest li ∈ Da such that li ≥ l. Similarly,
       of Pf all (p, γ) for any point p based on the triangle                             LPf all (a, l) = Pf all (a, lj ) where lj is the largest lj ∈ Da
       inequality. This implies we can prune or validate a point                          such that lj ≤ l. Then we have the following theorem by
       based on its distance to an anchor point.                                          rewriting Lemma 2 in a conservative way.
                                                                                          Theorem 3. Given an uncertain query Q and an anchor point
             Q                                                                   S2
                                                                                          a, for any rectangular region r and distance γ, we have:
                                                                 ε                           1) If γ > δmax (a, r), Pf all (r, γ) ≥ LPf all (a, γ −
                               a     p      γ                            γ                        δmax (a, r)).




                                                                                                    t.c om
                                                                                                       om
                                                          a                  p               2) Pf all (r, γ)        ≤      U Pf all (a, δmax (a, r) + γ)


                                                                                                  po t.c
                                                                                                  −LPf all (a, δmin (a, r) −γ − ) where is an arbitrarily
                                                                                                gs po
                                                                     Q
                               γ − δ (a , p )          S1                                     lo ssmall positive value.
                                                                                            .b og
                                                                                          ts .bl

                                                                                            Let LPf all (r, γ) and U Pf all (r, γ) represent the lower
                                                                                        ec ts


                    (a) Lower Bound                     (b) Upper Bound
                                                                                      oj c



                                                                                          and upper bounds of the falling probability derived from
                                                                                    pr oje




       Fig. 10. Lower and Upper Bound                                                     Theorem 3. We can safely prune (validate) an entry e if
                                                                                  re r
                                                                                lo rep




       Lemma 2. Let a denote an anchor point regarding the                                U Pf all (embb , γ) < θ (LPf all (embb , γ) ≥ θ). Recall that embb
                                                                              xp lo




                                                                                          represents the minimal bounding box of e. It takes O(d)
                                                                            ee xp




       uncertain query Q. For any point p ∈ S and a distance γ, we
                                                                                          time to compute δmax (a, embb ) and δmin (a, embb ). Mean-
                                                                         .ie ee




       have
                                                                        w e




                                                                                          while, the computation of LPf all (a, l) and U Pf all (a, l) for
                                                                       w .i




         1) If γ > δ(a, p), Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)).
                                                                      w w




                                                                                          any l > 0 costs O(log nd ) time because pre-computed
                                                                   :// w




         2) Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) −
                                                                tp //w




                                                                                          distance values in Da are sorted. Therefore, the filtering
             γ − ) where is an arbitrarily small positive value. 1
                                                              ht ttp:




                                                                                          time of each entry is O(d + log nd ) for each anchor point.
                                                               h




             Proof: Suppose γ > δ(a, p), then according to the tri-
       angle inequality for any x ∈ Q with δ(x, a) ≤ γ − δ(a, p),                         3.4.2 Heuristic with a finite number of anchor points
       we have δ(x, p) ≤ δ(a, p)+δ(x, a) ≤ δ(a, p)+(γ−δ(a, p)) =                          Let AP denote a set of anchor points for the uncertain
       γ. This implies that Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p))                       query Q. We do not need to further process an entry e in
       according to Equation 2. Fig. 10(a) illustrates an example                         Algorithm 1 if it is filtered by any anchor point a ∈ AP.
       of the proof in 2 dimensional space. In Fig. 10(a), we have                        Intuitively, the more anchor points employed by Q, the
       Ca,γ−δ(a,p) ⊆ Cp,γ if γ > δ(a, p). Let S denote the striped                        more powerful the filter will be. However, we cannot
       area which is the intersection of Ca,γ−δ(a,p) and Q.                               employ a large number of anchor points due to the space
       Clearly, we have Pf all (a, γ − δ(a, p)) = x∈S Q.pdf (x)dx                         and filtering time limitations. Therefore, it is important
       and δ(x, p) ≤ γ for any x ∈ S. Consequently, Pf all (p, γ)                         to investigate how to choose a limited number of anchor
       ≥ Pf all (a, γ − δ(a, p)) holds.                                                   points such that the filter can work effectively.
          With similar rationale, for any x ∈ Q we have
       δ(x, a) ≤ δ(a, p) + γ if δ(x, p) ≤ γ. This implies                                 Anchor points construction. We first investigate how
       that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ). Moreover, for                        to evaluate the “goodness” of an anchor point regard-
       any x ∈ Q with δ(x, a) ≤ δ(a, p) − γ − , we have                                   ing the computation of LPf all (p, γ). Suppose all anchor
       δ(x, a) > γ. Recall that represents an arbitrarily small                           points have the same falling probability functions; that
       constant value. This implies that x does not contribute                            is Pf all (ai , l) = Pf all (aj , l) for any two anchor points
       to Pf all (p, γ) if δ(x, a) ≤ δ(a, p) − γ − . Consequently,                        ai and aj . Then the closest anchor point regarding p
       Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − γ − )               will provide the largest LPf all (p, γ). Since there is no a
       holds. As shown in Fig. 10(b), we have Pf all (p, γ) ≤                             priori knowledge about the distribution of the points, we
       Pf all (a, δ(a, p) + γ) because Cp,γ ⊆ Ca,δ(a,p)+γ . Since                         assume they follow the uniform distribution. Therefore,
                                                                                          anchor points should be uniformly distributed. If falling
          1. We have Pf all (a, δ(a, p) − γ − ) = 0 if δ(a, p) ≤ γ                        probabilistic functions of the anchor points are different,
Efficient computation of range aggregates
Efficient computation of range aggregates
Efficient computation of range aggregates
Efficient computation of range aggregates
Efficient computation of range aggregates
Efficient computation of range aggregates
Efficient computation of range aggregates

Contenu connexe

En vedette

On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recoveryingenioustech
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement awareingenioustech
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache foringenioustech
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routingingenioustech
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on bufferingenioustech
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learningingenioustech
 

En vedette (7)

On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recovery
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement aware
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache for
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routing
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on buffer
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learning
 
Intrution detection
Intrution detectionIntrution detection
Intrution detection
 

Plus de ingenioustech

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicastingingenioustech
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems fromingenioustech
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization foringenioustech
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of addressingenioustech
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Online social network
Online social networkOnline social network
Online social networkingenioustech
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingingenioustech
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlingenioustech
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]ingenioustech
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image searchingenioustech
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_controlingenioustech
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titlesingenioustech
 

Plus de ingenioustech (18)

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems from
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization for
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of address
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Tcp
TcpTcp
Tcp
 
Privacy preserving
Privacy preservingPrivacy preserving
Privacy preserving
 
Peace
PeacePeace
Peace
 
Online social network
Online social networkOnline social network
Online social network
 
Layered approach
Layered approachLayered approach
Layered approach
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computing
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sql
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image search
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_control
 
Vebek
VebekVebek
Vebek
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titles
 

Dernier

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 

Dernier (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Efficient computation of range aggregates

  • 1. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Efficient Computation of Range Aggregates against Uncertain Location Based Queries Ying Zhang1 Xuemin Lin1,2 Yufei Tao3 Wenjie Zhang1 Haixun Wang4 1 University of New South Wales, {yingz,lxue, zhangw}@cse.unsw.edu.au 2 NICTA 3 Chinese University of Hong Kong, taoyf@cse.cuhk.edu.hk 4 Microsoft Research Asia, haixunw@microsoft.com Abstract—In many applications, including location based services, queries may not be precise. In this paper, we study the problem of efficiently computing range aggregates in a multidimensional space when the query location is uncertain. Specifically, for a query point Q whose location is uncertain and a set S of points in a multi-dimensional space, we want to calculate the aggregate (e.g., count, average and sum) over the subset S of S such that for each p ∈ S , Q has at least probability θ within the distance γ to p. We propose novel, efficient techniques to solve the problem following the filtering-and-verification paradigm. In particular, two novel filtering techniques are proposed to effectively and efficiently remove data points from verification. Our comprehensive experiments based on both real and synthetic data demonstrate the efficiency and scalability of our techniques. Index Terms—Uncertainty, Index, Range aggregate query ✦ 1 I NTRODUCTION p5 will be destroyed. Similarly, objects p2 , p3 and p6 will be destroyed if the actual falling point is q2 . In this appli- t.c om om Query imprecision or uncertainty may be often caused cation, the risk of civilian casualties may be measured by po t.c by the nature of many applications, including location the total number n of civilian objects which are within γ gs po based services. The existing techniques for processing distance away from a possible blast point with at least lo s .b og location based spatial queries regarding certain query θ probability. Note that the probabilistic threshold is set ts .bl points and data points are not applicable or inefficient ec ts by the commander based on the levels of trade-off that oj c when uncertain queries are involved. In this paper, we pr oje she wants to make between the risk of civilian damages re r investigate the problem of efficiently computing distance and the effectiveness of military attacks; for instance, it is lo rep based range aggregates over certain data points and unlikely to cause civilian casualties if n = 0 with a small xp lo ee xp uncertain query points as described in the abstract. In θ. Moreover, different weight values may be assigned .ie ee general, an uncertain query Q is a multi-dimensional to these target points and hence the aggregate can be w e w .i point that might appear at any location x following conducted based on the sum of the values. w w :// w a probabilistic density function pdf (x) within a region tp //w Q.region. There is a number of applications where a ht ttp: p1 p3 query point may be uncertain. Below are two sample h γ q1 applications. γ a q2 Motivating Application 1. A blast warhead carried by p5 p2 p6 a missile may destroy things by blast pressure waves in Q its lethal area where the lethal area is typically a circular p4 p7 area centered at the point of explosion (blast point) with Q : s h a d o w e d re g i o n t o i n d i c a t e t h e p o s s i b l e l o c a t i o n s o f t h e q u e ry radius γ [24] and γ depends on the explosive used. q1, q 2 : to i n d i c a te tw o p o s s i b l e l o c a ti o n s o f Q While firing such a missile, even the most advanced γ : q u e ry d i s t a n c e laser-guided missile cannot exactly hit the aiming point with 100% guarantee. The actual falling point (blast Fig. 1. Missile Example point) of a missile blast warhead regarding a target point usually follows some probability density functions Motivating Application 2. Similarly, we can also esti- (P DF s); different P DF s have been studied in [24] where mate the effectiveness of a police vehicle patrol route bivariate normal distribution is the simplest and the most using range aggregate against uncertain location based common one [24]. In military applications, firing such query Q. For example, Q in Fig. 1 now corresponds a missile may not only destroy military targets but may to the possible locations of a police patrol vehicle in a also damage civilian objects. Therefore, it is important to patrol route. A spot (e.g., restaurant, hotel, residential avoid the civilian casualties by estimating the likelihood property), represented by a point in {p1 , p2 , . . . , p7 } in of damaging civilian objects once the aiming point of a Fig. 1, is likely under reliable police patrol coverage [11] blast missile is determined. As depicted in Fig. 1, points if it has at least θ probability within γ distance to a {pi } for 1 ≤ i ≤ 7 represent some civilian objects (e.g., moving patrol vehicle, where γ and θ are set by domain residential buildings, public facilities ). If q1 in Fig. 1 is experts. The number of spots under reliable police patrol the actual falling point of the missile, then objects p1 and coverage is often deployed to evaluate the effectiveness Digital Object Indentifier 10.1109/TKDE.2011.46 1041-4347/11/$26.00 © 2011 IEEE
  • 2. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2 of the police patrol route. tion to be stored. Both of them can be applied to Motivated by the above applications, in the paper we continuous case and discrete case. study the problem of aggregate computation against the • Extensive experiments are conducted to demon- data points which have at least probability θ to be within strate the efficiency of our techniques. distance γ regarding an uncertain location based query. • While we focus on the problem of range counting for uncertain location based queries in the paper, our Challenges. A naive way to solve this problem is that techniques can be immediately extended to other for each data point p ∈ S, we calculate the probability, range aggregates. namely falling probability, of Q within γ distance to p, The remainder of the paper is organized as follows. select p against a given probability threshold, and then Section 2 formally defines the problem and presents conduct the aggregate. This involves the calculation of preliminaries. In Section 3, following the filtering-and- an integral regarding each p and Q.pdf for each p ∈ S; verification framework, we propose three filtering tech- unless Q.pdf has a very simple distribution (e.g., uniform niques. Section 4 evaluates the proposed techniques with distributions), such a calculation may often be very ex- extensive experiments. Then some possible extensions of pensive and the naive method may be computationally our techniques are discussed in Section 5. This is fol- prohibitive when a large number of data points is in- lowed by related work in Section 6. Section 7 concludes volved. In the paper we target the problem of efficiently the paper. computing range aggregates against an uncertain Q for arbitrary Q.pdf and Q.region. Note that when Q.pdf is a uniform distribution within a circular region Q.region, 2 BACKGROUND I NFORMATION a circular “window” can be immediately obtained ac- We first formally define the problem in Section 2.1, then cording to γ and Q.region so that the computation of Section 2.2 presents the PCR technique [26] which is em- range aggregates can be conducted via the window ployed in the filtering technique proposed in Section 3.3. aggregates [27] over S. t.c om Notation Definition om po t.c Q uncertain location based query Contributions. Our techniques are developed based on gs po S a set of points the standard filtering-and-verification paradigm. We first lo s q instance of an uncertain query Q .b og discuss how to apply the existing probabilistically con- d dimensionality ts .bl Pq the probability of the q to appear strained regions (PCR) technique [26] to our problem. ec ts θ and γ probabilistic threshold and query distance oj c Then, we propose two novel distance based filtering pr oje Pf all (Q, p, γ) the falling probability of p regarding techniques, statistical filtering (STF) and anchor point re r Q and γ lo rep filtering (APF), respectively, to address the inherent lim- Qθ,γ (S) {p|p ∈ S ∧ Pf all (Q, p, γ) ≥ θ} xp lo p, x, y, b(S) point (a set of data points) ee xp its of the PCR technique. The basic idea of the STF e R tree entry .ie ee technique is to bound the falling probability of the points Cp,r a circle(sphere) centred at p with radius r w e w .i by applying some well known statistical inequalities δ(x, y) the distance between x and y w w δmax(min) (r1 , r2 ) :// w where only a small amount of statistic information about the maximal(minimal) distance tp //w between two rectangular regions the uncertain location based query Q is required. The ht ttp: gQ mean of Q STF technique is simple and space efficient (only d + 2 ηQ weighted average distance of Q h float numbers required where d denotes the dimension- σQ variance of Q arbitrarily small positive constant value ality), and experiments show that it is effective. For the a anchor point scenarios where a considerable “large” space is available, nap the number of anchor points we propose a view based filter which consists of a set of LPf all (p, γ) lower bound of the Pf all (p, γ) anchor points. An anchor point may reside at any location U Pf all (p, γ) upper bound of the Pf all (p, γ) nd the number of different distances and its falling probability regarding Q is pre-computed pre-computed for each anchor point for several γ values. Then many data points might be Da a set of distance values used by effectively filtered based on their distances to the anchor anchor point a points. For a given space budget, we investigate how to TABLE 1 construct the anchor points and their accessing orders. The summary of notations. To the best of our knowledge, we are the first to identify the problem of computing range aggregates against uncertain location based query. In this paper, we 2.1 Problem Definition investigate the problem regarding both continuous and In the paper, S is a set of points in a d-dimensional discrete Q. Our principle contributions can be summa- numerical space. The distance between two points x and rized as follows. y is denoted by δ(x, y). Note that techniques developed in the paper can be applied to any distance metrics [5]. • We propose two novel filtering techniques, STF and In the examples and experiments, the Euclidean distance APF, respectively. The STF technique has a decent is used. For two rectangular regions r1 and r2 , we have filtering power and only requires the storage of very δmax (r1 , r2 ) = max∀x∈r1 ,y∈r2 δ(x, y) and limited pre-computed information. APF provides the flexibility to significantly enhance the filtering 0 if r1 ∩ r2 = ∅ δmin (r1 , r2 ) = (1) power by demanding more pre-computed informa- min∀x∈r1 ,y∈r2 δ(x, y) otherwise
  • 3. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 An uncertain (location based) query Q may be de- max, sum, avg, etc., over some non-locational attributes scribed by a continuous or a discrete distribution as fol- (e.g., weight value of the object in missile example). lows. Example 1. In Fig. 2, S = {p1 , p2 , p3 } and Q = {q1 , q2 , q3 } Definition 1 (Continuous Distribution). An uncertain where Pq1 = 0.4, Pq2 = 0.3 and Pq3 = 0.3. According query Q is described by a probabilistic density function Q.pdf . to Definition 3, for the given γ, we have Pf all (p1 , γ) = Let Q.region represent the region where Q might appear, then 0.4, Pf all (p2 , γ) = 1, and Pf all (p3 , γ) = 0.6. Therefore, x∈Q.region Q.pdf (x)dx = 1; Qθ,γ (S) = {p2 , p3 } if θ is set to 0.5, and hence |Qθ,γ (S)| = 2. Definition 2 (Discrete Distribution). An uncertain query Q consists of a set of instances (points) {q1 , q2 , . . . , qn } in a d- 2.2 Probabilistically Constrained Regions (PCR) dimensional numerical space where qi appears with probability Pqi , and q∈Q Pq = 1; In [26], Tao et al. study the problem of range query on uncertain objects, in which the query is a rectangular Note that, in Section 5 we also cover the applications window and the location of each object is uncertain. where Q can have a non-zero probability to be absent; Although the problem studied in [26] is different with that is, x∈Q.region Q.pdf (x)dx = c or q∈Q Pq = c for a the one in this paper, in Section 3.3 we show how c < 1. to modify the techniques developed in [26] to support For a point p, we use Pf all (Q, p, γ) to represent the uncertain location based query. probability of Q within γ distance to p, called falling In the following part, we briefly introduce the Prob- probability of p regarding Q and γ. It is formally defined abilistically Constrained Region (PCR) technique devel- below. oped in [26]. Same as the uncertain location based query, For continuous cases, an uncertain object U is modeled by a probability density function U.pdf (x) and an uncertain region U.region. Pf all (Q, p, γ) = Q.pdf (x)dx (2) x∈Q.region ∧ δ(x,p)≤γ The probability that the uncertain object U falls in the rectangular window query rq , denoted by Pf all (U, rq ), t.c om om For discrete cases, is defined as x∈U.region∩rq U.pdf (x)dx. In [26], the prob- po t.c gs po Pf all (Q, p, γ) = Pq (3) abilistically constrained region of the uncertain object lo s .b og q∈Q ∧ δ(q,p)≤γ U regarding probability θ (0 ≤ θ ≤ 0.5), denoted by ts .bl U.pcr(θ), is employed in the filtering technique. Partic- ec ts In the paper hereafter, Pf all (Q, p, γ) is abbreviated to oj c ularly, U.pcr(θ) is a rectangular region constructed as pr oje Pf all (p, γ), and Q.region and Q.pdf are abbreviated to Q follows. re r lo rep and pdf respectively, whenever there is no ambiguity. It For each dimension i, the projection of U.pcr(θ) xp lo is immediate that Pf all (p, γ) is a monotonically increas- [U.pcri− (θ), U.pcri+ (θ)] ee xp is denoted by where .ie ee ing function with respect to distance γ. x∈U.region&xi ≤U.pcri− (θ) U.pdf (x)dx = θ and w e w .i x∈U.region&xi ≥U.pcri+ (θ) U.pdf (x)dx = θ. Note that w w γ :// w γ xi represents the coordinate value of the point x tp //w p1 q1 on i-th dimension. Then U.pcr(θ) corresponds to ht ttp: Q q2 p3 a rectangular region [U.pcr− (θ), U.pcr+ (θ)] where h p2 q3 U.pcr− (θ) (U.pcr+ (θ)) is the lower (upper) corner and the coordinate value of U.pcr− (θ) (U.pcr+ (θ)) on i-th γ dimension is U.pcri− (θ) (U.pcri+ (θ)). Fig. 3(a) illustrates the U.pcr(0.2) of the uncertain object U in 2 dimensional space. Therefore, the probability mass of U on the left Fig. 2. Example of Pf all (Q, p, γ) (right) side of l1− (l1+ ) is 0.2 and the probability mass of Problem Statement. U below (above) the l2− (l2+ ) is 0.2 as well. Following In many applications, users are only interested in the is a motivating example of how to derive the lower and points with falling probabilities exceeding a given prob- upper bounds of the falling probability based on PCR. abilistic threshold regarding Q and γ. In this paper we Example 2. According to the definition of PCR, in Fig. 3(b) investigate the problem of probabilistic threshold based the probabilistic mass of U in the shaded area is 0.2, i.e., uncertain location range aggregate query on points data; x∈U.region&x1 ≥U.pcr1+ (θ) U.pdf (x)dx = 0.2. Then, it is im- it is formally defined below. mediate that Pf all (U, rq1 ) < 0.2 because rq1 does not intersect Definition 3. [Uncertain Range Aggregate Query] Given U.pcr(0.2). Similarly, we have Pf all (U, rq2 ) ≥ 0.2 because the a set S of points, an uncertain query Q, a query distance shaded area is enclosed by rq2 . γ and a probabilistic threshold θ, we want to compute an The following theorem [26] formally introduces how aggregate function (e.g., count, avg, and sum) against points to prune or validate an uncertain object U based on p ∈ Qθ,γ (S), where Qθ,γ (S) denotes a subset of points U.pcr(θ) or U.pcr(1 − θ). Note that we say an uncertain {p} ⊆ S such that Pf all (p, γ) ≥ θ. object is pruned (validated) if we can claim Pf all (U, rq ) < θ In this paper, our techniques will be presented based (Pf all (U, rq ) ≥ θ) based on the P CR. on the aggregate count. Nevertheless, they can be imme- Theorem 1. Given an uncertain object U , a range query rq diately extended to cover other aggregates, such as min, (rq is a rectangular window) and a probabilistic threshold θ.
  • 4. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 shows that techniques proposed in this section can be U region immediately applied to the discrete case. rq 2 l 2+ 3.1 A framework for filtering-and-verification Algo- rithm l 2- In this subsection, following the filtering-and-verification paradigm we present a general framework to support l 1+ uncertain range aggregate query based on the filtering l 1- rq1 technique. To facilitate the aggregate query computation, U .pcr(0.2) U .pcr(0.2) l 1+ we assume a set S of points is organized by an aggregate (a) PCR (b) PCR based filtering R-Tree [22], denoted by RS . Note that an entry e of RS might be a data entry or an intermediate entry Fig. 3. A 2d probabilistically constrained region (PCR (0.2)) where a data entry corresponds to a point in S and an 1) For θ > 0.5, U can be pruned if rq does not fully contain intermediate entry groups a set of data entries or child U.pcr(1 − θ); intermediate entries. Assume a filter, denoted by F , is 2) For θ ≤ 0.5, the pruning condition is that rq does not available to prune or validate a data entry (i.e., a point) intersect U.pcr(θ); or an intermediate entry (i.e., a set of points). 3) For θ > 0.5, the validating criterion is that rq com- Algorithm 1 illustrates the framework of the filtering- pletely contains the part of Umbb on the right (left) of and-verification Algorithm. Note that details of the fil- plane U.pcri− (1−θ) (U.pcri+ (1−θ)) for some i ∈ [1, d], tering techniques will be introduced in the following where Umbb is the minimal bounding box of uncertain subsections. The algorithm consists of two phases. In the region U.region; filtering phase (Line 3-16), for each entry e of RS to be t.c om 4) For θ ≤ 0.5 the validating criterion is that rq completely processed, we do not need to further process e if it is om po t.c contains the part of Umbb on the left (right) of plane pruned or validated by the filter F . We say an entry e is gs po U.pcri− (θ) (U.pcri+ (θ)) for some i ∈ [1, d]; pruned (validated) if the filter can claim Pf all (p, γ) < θ lo s .b og (Pf all (p, γ) ≥ θ) for any point p within embb . The counter ts .bl 3 Filtering-and-Verification A LGORITHM cn is increased by |e| (Line 6) if e is validated where ec ts oj c |e| denotes the aggregate value of e (i.e., the number pr oje According to the definition of falling probability (i.e., of data points in e). Otherwise, the point p associated re r Pf all (p, γ)) in Equation 2, the computation involves in- lo rep with e is a candidate point if e corresponds to a data xp lo tegral calculation, which may be expensive in terms of entry (Line 10), and all child entries of e are put into the ee xp CPU cost. Based on Definition 3, we only need to know .ie ee queue for further processing if e is an intermediate entry whether or not the falling probability of a particular point w e (Line 12). The filtering phase terminates when the queue w .i w w regarding Q and γ exceeds the probabilistic threshold :// w is empty. In the verification phase (Line 17-21), candidate tp //w for the uncertain aggregate range query. This motivates points are verified by the integral calculations according ht ttp: us to follow the filtering-and-verification paradigm for the to Equation 2. h uncertain aggregate query computation. Particularly, in the filtering phase, effective and efficient filtering tech- Cost Analysis. The total time cost of Algorithm 1 is as niques will be applied to prune or validate the points. We follows. say a point p is pruned (validated) regarding the uncertain Cost = Nf × Cf + Nio × Cio + Nca × Cvf (4) query Q, distance γ and probabilistic threshold θ if we can claim that Pf all (p, γ) < θ ( Pf all (p, γ) ≥ θ ) based on Particularly, Nf represents the number of entries being the filtering techniques without explicitly computing the tested by the filter on Line 5 and Cf is the time cost Pf all (p, γ). The points that cannot be pruned or validated for each test. Nio denotes the number of nodes (pages) will be verified in the verification phase in which their accessed (Line 13) and Cio corresponds to the delay of falling probabilities are calculated. Therefore, it is desirable each node (page) access of RS . Nca represents the size to develop effective and efficient filtering techniques to of candidate set C and Cvf is the computation cost for prune or validate points such that the number of points each verification (Line 15) in which numerical integral being verified can be significantly reduced. computation is required. With a reasonable filtering time In this section, we first present a general framework cost (i.e., Cvf ), the dominant cost of Algorithm 1 is for the filtering-and-verification Algorithm based on fil- determined by Nio and Nca because Cio and Cvf might tering techniques in Section 3.1. Then a set of filtering be expensive. Therefore, in the paper we aim to develop techniques are proposed. Particularly, Section 3.2 pro- effective and efficient filtering techniques to reduce Nca poses the statistical filtering technique. Then we investi- and Nio . gate how to apply the PCR based filtering technique in Filtering. Suppose there is no filter F in Algorithm 1, Section 3.3. Section 3.4 presents the anchor point based all points in S will be verified. Regarding the example filtering technique. in Fig. 4, 5 points p1 , p2 , p3 , p4 and p5 will be veri- For presentation simplicity, we consider the continuous fied. A straitforward filtering technique is based on the case of the uncertain query in this section. Section 3.5 minimal and maximal distances between the minimal
  • 5. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5 Algorithm 1 Filtering-and-Verification(RS , Q, F , γ, θ) in Algorithm 1 is significantly increased. In the following subsections, we present three filtering techniques, named Input: RS : an aggregate R tree on data set S, STF, PCR and APF respectively, which can significantly Q : uncertain query, F : Filter, γ : query distance, enhance the filtering capability of the filter. θ : probabilistic threshold. Output: |Qθ,γ (S)| 3.2 Statistical Filter Description: 1: Queue := ∅; cn := 0; C := ∅; In this subsection, we propose a statistical filtering tech- 2: Insert root of RS into Queue; nique, namely STF. After introducing the motivation 3: while Queue = ∅ do of the technique, we present some important statistic 4: e ← dequeue from the Queue; information of the uncertain query and then show how 5: if e is validated by the filter F then to derive the lower and upper bounds of the falling 6: cn := cn + |e|; probability of a point regarding an uncertain query Q, 7: else distance γ and probabilistic threshold θ. 8: if e is not pruned by the filter F then Motivation. As shown in Fig. 5, given an uncertain 9: if e is data entry then query Q1 and γ we cannot prune point p based on the 10: C := C ∪ p where p is the data point e MMD technique, regardless of the value of θ, although represented; intuitively the falling probability of p regarding Q1 is 11: else likely to be small. Similarly, we cannot validate p for Q2 . 12: put all child entries of e into Queue; This motivates us to develop a new filtering technique 13: end if which is as simple as MMD, but can exploit θ to en- 14: end if hance the filtering capability. In the following part, we 15: end if show that lower and upper bounds of Pf all (p, γ) can be 16: end while derived based on some statistics of the uncertain query. t.c om om 17: for each point p ∈ C do Then a point may be immediately pruned (validated) po t.c 18: if Pf all (Q, p, γ) ≥ θ then based on the upper(lower) bound of Pf all (p, γ), denoted gs po 19: cn := cn + 1; by U Pf all (p, γ) (LPf all (p, γ)). lo s .b og 20: end if Example 4. In Fig. 5 suppose θ = 0.5 and we have ts .bl U Pf all (Q1 , p, γ) = 0.4 (LPf all (Q2 , p, γ) = 0.6) based on ec ts 21: end for oj c pr oje 22: Return cn the statistical bounds, then p can be safely pruned (vali- re r dated) without explicitly computing its falling probability lo rep bounding boxes(MBBs) of an entry and the uncertain regarding Q1 (Q2 ). Regarding the running example in Fig.4, xp lo query. Clearly, for any θ we can safely prune an entry if ee xp suppose θ = 0.2 and we have U Pf all (p2 , γ) = 0.15 while δmin (Qmbb , embb ) > γ or validate it if δmax (Qmbb , embb ) ≤ .ie ee U Pf all (pi , γ) ≥ 0.2 for 3 ≤ i ≤ 5, then p2 is pruned. There- w e γ. We refer this as maximal/minimal distance based filter- w .i fore, three points (p3 , p4 and p5 ) are verified in Algorithm 1 w w ing technique, namely MMD. MMD technique is time ef- :// w when MMD and statistical filtering techniques are applied. tp //w ficient as it takes only O(d) time to compute the minimal ht ttp: and maximal distances between Qmbb and embb . Recall Q1 γ h that Qmbb is the minimal bounding box of Q.region. Q2 g Q1 p g Q2 p1 p5 p4 Fig. 5. Motivation Example p2 Statistics of the uncertain query p3 To apply the statistical filtering technique, the follow- ing statistics of the uncertain query Q are pre-computed. Q : u n c er t ai n l oc at i on b as ed r an g e q u er y Definition 4 (mean (gQ )). gQ = x∈Q x × Q.pdf (x)dx. Fig. 4. Running Example Definition 5 (weighted average distance (ηQ )). ηQ equals Example 3. As shown in Fig. 4, suppose the MMD filtering x∈Q δ(x, gQ ) × Q.pdf (x)dx technique is applied in Algorithm 1, then p1 is pruned and Definition 6 (variance (σQ )). σQ equals x∈Q δ(x, gQ )2 × the other 4 points p2 , p3 , p4 and p5 will be verified. Q.pdf (x)dx Although the MMD technique is very time efficient, its Derive lower and upper bounds of Pf all (p, γ). filtering capacity is limited because it does not make use For a point p ∈ S, the following theorem shows how of the distribution information of the uncertain query Q to derive the lower and upper bounds of Pf all (p, γ) and the probabilistic threshold θ. This motivates us to de- based on above statistics of Q. Then, without explicitly velop more effective filtering techniques based on some computing Pf all (p, γ), we may prune or validate the point pre-computations on the uncertain query Q such that the p based on U Pf all (p, γ) and LPf all (p, γ) derived based on number of entries (i.e., points ) being pruned or validated the statistics of Q.
  • 6. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6 Theorem 2. Given an uncertain query Q and a distance γ, σ = V ar(Y ), according to Lemma 1 when γ > μ we and suppose the mean gQ , weighted average distance ηQ and have variance σQ of Q are available. Then for a point p, we have 1 1 P r(Y ≤ γ) ≤ P (Y ≥ γ ) ≤ (6) 1) If γ > μ1 , Pf all (p, γ) ≥ 1 − (γ−μ1 )2 , where μ1 = (γ −μ )2 1+ σ2 1+ 2 σ1 2 2 δ(gQ , p) + ηQ and σ1 = σQ − ηQ + 4ηQ × δ(gQ , p). Because values of μ, σ 2 , μ and σ 2 may change re- 1 2) If γ < δ(gQ , p) − ηQ − , Pf all (p, γ) ≤ (γ −μ2 )2 , garding different point p ∈ S, we cannot pre-compute 1+ 2 σ2 them. Nevertheless, in the following part we show that where μ2 2 =Δ+ 2 = σQ − ηQ , σ2 2 + 4ηQ × Δ, Δ = ηQ their upper bounds can be derived based on the statistic γ + γ + − δ(p, gQ ) and γ > 0. The represents an information of the Q, which can be pre-computed based arbitrarily small positive constant value. on the probabilistic distribution of Q. Before the proof of Theorem 2, we first introduce the p Cantelli’s Inequality [19] described by Lemma 1 which is Q γ one-sided version of the Chebyshev Inequality. Lemma 1. Let X be an univariate random variable with the expected value μ and the finite variance σ 2 . Then for any 1 C > 0, P r(X − μ ≥ C × σ) ≤ 1+C 2 . gQ ε Following is the proof of Theorem 2. Proof: Intuition of the Proof. For a given point p ∈ S, p' γ' its distance to Q can be regarded as an univariate random variable Y , and we have Pf all (p, γ) = P r(Y ≤ γ). Given Fig. 6. Proof of Upper bound γ, we can derive the lower and upper bounds of P r(Y ≤ Based on the triangle inequality, for any x ∈ Q we γ) (Pf all (p, γ)) based on the statistical inequality in have δ(x, p) ≤ δ(x, gQ )+δ(p, gQ ) and δ(x, p) ≥ | δ(x, gQ )− t.c om om Lemma 1 if the expectation (E(Y )) and variance(V ar(Y )) δ(p, gQ ) | for any x ∈ Q. Then we have po t.c of the random variable Y are available. Although E(Y ) and V ar(Y ) take different values regarding different μ =gs po y × Y.pdf (y)dy = δ(x, p) × pdf (x)dx lo s .b og points, we show that the upper bounds of E(Y ) and y∈Y x∈Q ts .bl ec ts V ar(Y ) can be derived based on mean(gQ ), weighted ≤ (δ(p, gQ ) + δ(x, gQ )) × pdf (x)dx oj c pr oje average distance (ηQ ) and variance(σQ ) of the query Q. x∈Q re r ≤ δ(gQ , p) + ηQ = μ1 lo rep Then, the correctness of the theorem follows. xp lo Details of the Proof. The uncertain query Q is a ran- ee xp and dom variable which equals x ∈ Q.region with prob- .ie ee ability Q.pdf (x). For a given point p, let Y denote σ2 = E(Y 2 ) − E 2 (Y ) w e w .i w w the distance distribution between p and Q; that is, :// w Y is an univariate random variable and Y.pdf (l) = ≤ (δ(gQ , p) + δ(x, gQ ))2 pdf (x)dx tp //w x∈Q ht ttp: x∈Q.region and δ(x,p)=l Q.pdf (x)dx for any l ≥ 0. Conse- −(δ(gQ , p) − ηQ )2 h quently, we have Pf all (p, γ) = P r(Y ≤ γ) according to Equation 2. Let μ = E(Y ), σ 2 = V ar(Y ) and C = γ−μ , σ = 2 δ(gQ , p) × δ(x, gQ ) × pdf (x)dx then based on lemma 1, if γ > μ we have x∈Q 1 + δ(x, gQ )2 × pdf (x)dx P r(Y ≥ γ) = P r(Y − μ ≥ C × σ) ≤ 1+ ( γ−μ )2 σ x∈Q 2 +2 × δ(gQ , p) × ηQ − ηQ 2 2 Then it is immediate that = σQ − ηQ + 4ηQ × δ(gQ , p) = σ1 1 Together with Inequality 5, we have P r(Y ≤ γ) ≥ P r(Y ≤ γ) ≥ 1 − P r(Y ≥ γ) ≥ 1 − (γ−μ)2 (5) 1+ 1 1 σ2 1− (γ−μ)2 ≥ 1− (γ−μ1 )2 if μ1 < γ. With similar 1+ σ2 1+ 2 σ1 According to Inequation 5 we can derive the lower rationale, let Δ = δ(gQ , p ) = γ + γ + − δ(p, gQ ) we bound of Pf all (p, γ). Next, we show how to derive have μ ≥ Δ + ηQ = μ2 and σ 2 ≤ σQ − ηQ + 4ηQ × 2 upper bound of Pf all (p, γ). As illustrated in Fig. 6, 2 Δ = σ2 . Based on Inequality 6, we have P r(Y ≤ γ) ≤ let p denote a dummy point on the line pgQ with 1 1 (γ −μ )2 ≤ (γ −μ2 )2 if γ < δ(gQ , p)− ηQ − . Therefore, δ(p , p) = γ + γ + where is an arbitrarily small 1+ σ 2 1+ 2 σ2 positive constant value. Similar to the definition of the correctness of the theorem follows. Y , let Y be the distance distribution between p and The following extension is immediate based on the Q; that is, Y is an univariate random variable where similar rationale of Theorem 2. Y .pdf (l) = x∈Q.region and δ(x,p )=l Q.pdf (x)dx for any Extension 1. Suppose r is a rectangular region, we can l ≥ 0. Then, as shown in Fig. 6, for any point x ∈ Q use δmin (r, gQ ) and δmax (r, gQ ) to replace δ(p, gQ ) in with δ(x, p ) ≤ γ (shaded area), we have δ(x, p) > γ. This Theorem 2 for lower and upper probabilistic bounds implies that P (Y ≤ γ) ≤ P (Y ≥ γ ). Let μ = E(Y ) and computation respectively.
  • 7. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 Based on Extension 1, we can compute the upper Q.pcr(0.2) according to Theorem 1 in Section 2.2. Conse- and lower bounds of Pf all (embb , γ) where embb is the quently, only p3 and p5 go to the verification phase when minimal bounding box of the entry e, and hence prune Q.pcr(0.2) is available. or validate e in Algorithm 1. Since gQ , ηQ and σQ are Same as [26], [28], a finite number of P CRs are pre- pre-computed, the dominant cost in filtering phase is computed for the uncertain query Q regarding different the distance computation between embb and gQ which probability values. For a given θ at query time, if the is O(d). Q.pcr(θ) is not pre-computed we can choose two pre- computed P CRs Q.pcr(θ1 ) and Q.pcr(θ2 ) where θ1 (θ2 ) is the largest (smallest) existing probability value smaller 3.3 PCR based Filter (larger) than θ. We can apply the modified PCR tech- Motivation. Although the statistical filtering technique nique as the filter in Algorithm 1, and the filtering time can significantly reduce the candidate size in Algo- regarding each entry tested is O(m+log(m)) in the worst rithm 1, the filtering capacity is inherently limited be- case , where m is the number of P CRs pre-computed by cause only a small amount of statistics are employed. the filter. This motivates us to develop more sophisticated filtering The PCR technique can significantly enhance the filter- techniques to further improve the filtering capacity; that ing capacity when a particular number of PCR s are pre- is, we aim to improve the filtering capacity with more computed. The key of the PCR filtering technique is to pre-computations (i.e., more information kept for the partition the uncertain query along each dimension. This filter). In this subsection, the PCR technique proposed may inherently limit the filtering capacity of the PCR in [26] will be modified for this purpose. based filtering technique. As shown in Fig. 7, we have to use two rectangular regions for pruning and validation R+ , p C p, purpose, and hence the Cp,γ is enlarged (shrunk) during Q region the computation. As illustrated in Fig. 7, all instances of t.c om Q in the striped area is counted for Pf all (p, γ) regarding om po t.c p R+,p , while all of them have distances larger than γ. Sim- gs po ilar observation goes to R−,p . This limitation is caused lo s .b og Q . pcr ( 0 .4 ) by the transformation, and cannot be remedied by in- ts .bl creasing the number of P CRs. Our experiments also ec ts R oj c ,p confirm that the PCR technique cannot take advantage of pr oje Fig. 7. Transform query re r the large index space. This motivates us to develop new lo rep filtering technique to find a better trade-off between the xp lo PCR based Filtering technique. The PCR technique ee xp filtering capacity and pre-computation cost (i.e., index proposed in [26] cannot be directly applied for filtering .ie ee size). w e in Algorithm 1 because the range query studied in [26] w .i w w is a rectangular window and objects are uncertain. Nev- :// w tp //w ertheless we can adapt the PCR technique as follows. 3.4 Anchor Points based Filter ht ttp: As shown in Fig. 7, let Cp,γ represent the circle (sphere) h centered at p with radius γ. Then we can regard the The anchor (pivot) point technique is widely employed uncertain query Q and Cp,γ as an uncertain object and in various applications, which aims to reduce the query the range query respectively. As suggested in [28], we computation cost based on some pre-computed anchor can use R+,p (mbb of Cp,γ ) and R−,p (inner box) as (pivot) points. In this subsection, we investigate how to shown in Fig. 7 to prune and validate the point p based apply anchor point technique to effectively and efficiently on the P CRs of Q respectively. For instance, if θ = 0.4 reduce the candidate set size. Following is a motivating the point p in Fig. 7 can be pruned according to case 2 example for the anchor point based filtering technique. of Theorem 1 because R1 does not intersect Q.pcr(0.4). = 0.2 Note that similar transformation can be applied for the p1 intermediate entries as well. p5 p4 R+ , p1 R+ , p3 p1 p2 R+ , p5 o p3 d p3 p5 R+ , p2 Q.mbb Co ,d p4 p2 R+ , p4 Fig. 9. Running example regarding the anchor point Q.mbb Q.pcr(0.2) Motivating Example. Regarding our running example, Fig. 8. Running example in Fig. 9 the shaded area, denoted by Co,d , is the circle centered at o with radius d. Suppose the probabilistic Example 5. Regarding the running example in Fig. 8, mass of Q in Co,d is 0.8, then when θ = 0.2 we can safely suppose Q.pcr(0.2) is pre-computed, then p1 , p2 and p4 prune p1 , p2 , p3 and p4 because Cpi ,γ does not intersect are pruned because R+,p1 , R+,p2 and R+,p4 do not overlap Co,d for i = 1, 2, 3 and 4.
  • 8. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8 In the paper, an anchor point a regarding the uncertain Ca,δ(a,p)−γ− ⊆ Ca,δ(a,p)+γ and Cp,γ ∩ Cp,δ(a,p)−γ− = ∅, query Q is a point in multidimensional space whose this implies that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − falling probability against different γ values are pre- Pf all (a, δ(a, p) − γ − ). computed. We can prune or validate a point based on its distance to the anchor point. For better filtering capability, Let LPf all (p, γ) and U Pf all (p, γ) denote the lower a set of anchor points will be employed. and upper bounds derived from Lemma 2 regarding In the following part, Section 3.4.1 presents the anchor Pf all (p, γ). Then we can immediately validate a point p if point filtering technique. In Section 3.4.2, we investigate LPf all (p, γ) ≥ θ, or prune p if U Pf all (p, γ) < θ. how to construct anchor points for a given space budget, Clearly, it is infeasible to keep Pf all (a, l) for arbitrary followed by a time efficient filtering algorithm in Sec- l ≥ 0. Since Pf all (a, l) is a monotonic function with tion 3.4.3. respect to l, we keep a set Da = {li } with size nd for each anchor point such that Pf all (a, li ) = nid for 1 ≤ i ≤ nd . 3.4.1 Anchor Point filtering technique (APF) Then for any l > 0, we use U Pf all (a, l) and LPf all (a, l) For a given anchor point a regarding the uncertain query to represent the upper and lower bound of Pf all (a, l) Q, suppose Pf all (a, l) is pre-computed for arbitrary dis- respectively. Particularly, U Pf all (a, l) = Pf all (a, li ) where tance l. Lemma 2 provides lower and upper bounds li is the smallest li ∈ Da such that li ≥ l. Similarly, of Pf all (p, γ) for any point p based on the triangle LPf all (a, l) = Pf all (a, lj ) where lj is the largest lj ∈ Da inequality. This implies we can prune or validate a point such that lj ≤ l. Then we have the following theorem by based on its distance to an anchor point. rewriting Lemma 2 in a conservative way. Theorem 3. Given an uncertain query Q and an anchor point Q S2 a, for any rectangular region r and distance γ, we have: ε 1) If γ > δmax (a, r), Pf all (r, γ) ≥ LPf all (a, γ − a p γ γ δmax (a, r)). t.c om om a p 2) Pf all (r, γ) ≤ U Pf all (a, δmax (a, r) + γ) po t.c −LPf all (a, δmin (a, r) −γ − ) where is an arbitrarily gs po Q γ − δ (a , p ) S1 lo ssmall positive value. .b og ts .bl Let LPf all (r, γ) and U Pf all (r, γ) represent the lower ec ts (a) Lower Bound (b) Upper Bound oj c and upper bounds of the falling probability derived from pr oje Fig. 10. Lower and Upper Bound Theorem 3. We can safely prune (validate) an entry e if re r lo rep Lemma 2. Let a denote an anchor point regarding the U Pf all (embb , γ) < θ (LPf all (embb , γ) ≥ θ). Recall that embb xp lo represents the minimal bounding box of e. It takes O(d) ee xp uncertain query Q. For any point p ∈ S and a distance γ, we time to compute δmax (a, embb ) and δmin (a, embb ). Mean- .ie ee have w e while, the computation of LPf all (a, l) and U Pf all (a, l) for w .i 1) If γ > δ(a, p), Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)). w w any l > 0 costs O(log nd ) time because pre-computed :// w 2) Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − tp //w distance values in Da are sorted. Therefore, the filtering γ − ) where is an arbitrarily small positive value. 1 ht ttp: time of each entry is O(d + log nd ) for each anchor point. h Proof: Suppose γ > δ(a, p), then according to the tri- angle inequality for any x ∈ Q with δ(x, a) ≤ γ − δ(a, p), 3.4.2 Heuristic with a finite number of anchor points we have δ(x, p) ≤ δ(a, p)+δ(x, a) ≤ δ(a, p)+(γ−δ(a, p)) = Let AP denote a set of anchor points for the uncertain γ. This implies that Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)) query Q. We do not need to further process an entry e in according to Equation 2. Fig. 10(a) illustrates an example Algorithm 1 if it is filtered by any anchor point a ∈ AP. of the proof in 2 dimensional space. In Fig. 10(a), we have Intuitively, the more anchor points employed by Q, the Ca,γ−δ(a,p) ⊆ Cp,γ if γ > δ(a, p). Let S denote the striped more powerful the filter will be. However, we cannot area which is the intersection of Ca,γ−δ(a,p) and Q. employ a large number of anchor points due to the space Clearly, we have Pf all (a, γ − δ(a, p)) = x∈S Q.pdf (x)dx and filtering time limitations. Therefore, it is important and δ(x, p) ≤ γ for any x ∈ S. Consequently, Pf all (p, γ) to investigate how to choose a limited number of anchor ≥ Pf all (a, γ − δ(a, p)) holds. points such that the filter can work effectively. With similar rationale, for any x ∈ Q we have δ(x, a) ≤ δ(a, p) + γ if δ(x, p) ≤ γ. This implies Anchor points construction. We first investigate how that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ). Moreover, for to evaluate the “goodness” of an anchor point regard- any x ∈ Q with δ(x, a) ≤ δ(a, p) − γ − , we have ing the computation of LPf all (p, γ). Suppose all anchor δ(x, a) > γ. Recall that represents an arbitrarily small points have the same falling probability functions; that constant value. This implies that x does not contribute is Pf all (ai , l) = Pf all (aj , l) for any two anchor points to Pf all (p, γ) if δ(x, a) ≤ δ(a, p) − γ − . Consequently, ai and aj . Then the closest anchor point regarding p Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − γ − ) will provide the largest LPf all (p, γ). Since there is no a holds. As shown in Fig. 10(b), we have Pf all (p, γ) ≤ priori knowledge about the distribution of the points, we Pf all (a, δ(a, p) + γ) because Cp,γ ⊆ Ca,δ(a,p)+γ . Since assume they follow the uniform distribution. Therefore, anchor points should be uniformly distributed. If falling 1. We have Pf all (a, δ(a, p) − γ − ) = 0 if δ(a, p) ≤ γ probabilistic functions of the anchor points are different,