Efficient computation of range aggregates

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1

Efficient Computation of Range Aggregates
against Uncertain Location Based Queries
Ying Zhang1 Xuemin Lin1,2 Yufei Tao3 Wenjie Zhang1 Haixun Wang4
1
University of New South Wales, {yingz,lxue, zhangw}@cse.unsw.edu.au 2 NICTA
3
Chinese University of Hong Kong, taoyf@cse.cuhk.edu.hk 4 Microsoft Research Asia, haixunw@microsoft.com

Abstract—In many applications, including location based services, queries may not be precise. In this paper, we study the problem
of efficiently computing range aggregates in a multidimensional space when the query location is uncertain. Specifically, for a query
point Q whose location is uncertain and a set S of points in a multi-dimensional space, we want to calculate the aggregate (e.g., count,
average and sum) over the subset S of S such that for each p ∈ S , Q has at least probability θ within the distance γ to p. We
propose novel, efficient techniques to solve the problem following the filtering-and-verification paradigm. In particular, two novel filtering
techniques are proposed to effectively and efficiently remove data points from verification. Our comprehensive experiments based on
both real and synthetic data demonstrate the efficiency and scalability of our techniques.

Index Terms—Uncertainty, Index, Range aggregate query

✦

1 I NTRODUCTION p5 will be destroyed. Similarly, objects p2 , p3 and p6 will
be destroyed if the actual falling point is q2 . In this appli-

t.c om
om
Query imprecision or uncertainty may be often caused cation, the risk of civilian casualties may be measured by

po t.c
by the nature of many applications, including location the total number n of civilian objects which are within γ
gs po
based services. The existing techniques for processing distance away from a possible blast point with at least
lo s
.b og
location based spatial queries regarding certain query θ probability. Note that the probabilistic threshold is set
ts .bl

points and data points are not applicable or inefficient
ec ts

by the commander based on the levels of trade-off that
oj c

when uncertain queries are involved. In this paper, we
pr oje

she wants to make between the risk of civilian damages
re r

investigate the problem of efficiently computing distance and the effectiveness of military attacks; for instance, it is
lo rep

based range aggregates over certain data points and unlikely to cause civilian casualties if n = 0 with a small
xp lo
ee xp

uncertain query points as described in the abstract. In θ. Moreover, different weight values may be assigned
.ie ee

general, an uncertain query Q is a multi-dimensional to these target points and hence the aggregate can be
w e
w .i

point that might appear at any location x following conducted based on the sum of the values.
w w
:// w

a probabilistic density function pdf (x) within a region
tp //w

Q.region. There is a number of applications where a
ht ttp:

p1
p3
query point may be uncertain. Below are two sample
h

γ q1
applications. γ
a q2
Motivating Application 1. A blast warhead carried by p5 p2 p6
a missile may destroy things by blast pressure waves in Q
its lethal area where the lethal area is typically a circular p4
p7
area centered at the point of explosion (blast point) with
Q : s h a d o w e d re g i o n t o i n d i c a t e t h e p o s s i b l e l o c a t i o n s o f t h e q u e ry
radius γ [24] and γ depends on the explosive used. q1, q 2 : to i n d i c a te tw o p o s s i b l e l o c a ti o n s o f Q
While firing such a missile, even the most advanced γ : q u e ry d i s t a n c e
laser-guided missile cannot exactly hit the aiming point
with 100% guarantee. The actual falling point (blast Fig. 1. Missile Example
point) of a missile blast warhead regarding a target
point usually follows some probability density functions Motivating Application 2. Similarly, we can also esti-
(P DF s); different P DF s have been studied in [24] where mate the effectiveness of a police vehicle patrol route
bivariate normal distribution is the simplest and the most using range aggregate against uncertain location based
common one [24]. In military applications, firing such query Q. For example, Q in Fig. 1 now corresponds
a missile may not only destroy military targets but may to the possible locations of a police patrol vehicle in a
also damage civilian objects. Therefore, it is important to patrol route. A spot (e.g., restaurant, hotel, residential
avoid the civilian casualties by estimating the likelihood property), represented by a point in {p1 , p2 , . . . , p7 } in
of damaging civilian objects once the aiming point of a Fig. 1, is likely under reliable police patrol coverage [11]
blast missile is determined. As depicted in Fig. 1, points if it has at least θ probability within γ distance to a
{pi } for 1 ≤ i ≤ 7 represent some civilian objects (e.g., moving patrol vehicle, where γ and θ are set by domain
residential buildings, public facilities ). If q1 in Fig. 1 is experts. The number of spots under reliable police patrol
the actual falling point of the missile, then objects p1 and coverage is often deployed to evaluate the effectiveness

Digital Object Indentifier 10.1109/TKDE.2011.46 1041-4347/11/$26.00 © 2011 IEEE

2

of the police patrol route. tion to be stored. Both of them can be applied to
Motivated by the above applications, in the paper we continuous case and discrete case.
study the problem of aggregate computation against the • Extensive experiments are conducted to demon-
data points which have at least probability θ to be within strate the efficiency of our techniques.
distance γ regarding an uncertain location based query. • While we focus on the problem of range counting for
uncertain location based queries in the paper, our
Challenges. A naive way to solve this problem is that techniques can be immediately extended to other
for each data point p ∈ S, we calculate the probability, range aggregates.
namely falling probability, of Q within γ distance to p,
The remainder of the paper is organized as follows.
select p against a given probability threshold, and then
Section 2 formally defines the problem and presents
conduct the aggregate. This involves the calculation of
preliminaries. In Section 3, following the filtering-and-
an integral regarding each p and Q.pdf for each p ∈ S;
verification framework, we propose three filtering tech-
unless Q.pdf has a very simple distribution (e.g., uniform
niques. Section 4 evaluates the proposed techniques with
distributions), such a calculation may often be very ex-
extensive experiments. Then some possible extensions of
pensive and the naive method may be computationally
our techniques are discussed in Section 5. This is fol-
prohibitive when a large number of data points is in-
lowed by related work in Section 6. Section 7 concludes
volved. In the paper we target the problem of efficiently
the paper.
computing range aggregates against an uncertain Q for
arbitrary Q.pdf and Q.region. Note that when Q.pdf is
a uniform distribution within a circular region Q.region, 2 BACKGROUND I NFORMATION
a circular “window” can be immediately obtained ac- We first formally define the problem in Section 2.1, then
cording to γ and Q.region so that the computation of Section 2.2 presents the PCR technique [26] which is em-
range aggregates can be conducted via the window ployed in the filtering technique proposed in Section 3.3.
aggregates [27] over S.

t.c om
Notation Definition

om
po t.c
Q uncertain location based query
Contributions. Our techniques are developed based on
gs po
S a set of points
the standard filtering-and-verification paradigm. We first lo s q instance of an uncertain query Q
.b og
discuss how to apply the existing probabilistically con- d dimensionality
ts .bl

Pq the probability of the q to appear
strained regions (PCR) technique [26] to our problem.
ec ts

θ and γ probabilistic threshold and query distance
oj c

Then, we propose two novel distance based filtering
pr oje

Pf all (Q, p, γ) the falling probability of p regarding
techniques, statistical filtering (STF) and anchor point
re r

Q and γ
lo rep

filtering (APF), respectively, to address the inherent lim- Qθ,γ (S) {p|p ∈ S ∧ Pf all (Q, p, γ) ≥ θ}
xp lo

p, x, y, b(S) point (a set of data points)
ee xp

its of the PCR technique. The basic idea of the STF e R tree entry
.ie ee

technique is to bound the falling probability of the points Cp,r a circle(sphere) centred at p with radius r
w e
w .i

by applying some well known statistical inequalities δ(x, y) the distance between x and y
w w

δmax(min) (r1 , r2 )
:// w

where only a small amount of statistic information about the maximal(minimal) distance
tp //w

between two rectangular regions
the uncertain location based query Q is required. The
ht ttp:

gQ mean of Q
STF technique is simple and space efficient (only d + 2 ηQ weighted average distance of Q
h

float numbers required where d denotes the dimension- σQ variance of Q
arbitrarily small positive constant value
ality), and experiments show that it is effective. For the a anchor point
scenarios where a considerable “large” space is available, nap the number of anchor points
we propose a view based filter which consists of a set of LPf all (p, γ) lower bound of the Pf all (p, γ)
anchor points. An anchor point may reside at any location U Pf all (p, γ) upper bound of the Pf all (p, γ)
nd the number of different distances
and its falling probability regarding Q is pre-computed pre-computed for each anchor point
for several γ values. Then many data points might be Da a set of distance values used by
effectively filtered based on their distances to the anchor anchor point a
points. For a given space budget, we investigate how to TABLE 1
construct the anchor points and their accessing orders. The summary of notations.
To the best of our knowledge, we are the first to
identify the problem of computing range aggregates
against uncertain location based query. In this paper, we 2.1 Problem Definition
investigate the problem regarding both continuous and In the paper, S is a set of points in a d-dimensional
discrete Q. Our principle contributions can be summa- numerical space. The distance between two points x and
rized as follows. y is denoted by δ(x, y). Note that techniques developed
in the paper can be applied to any distance metrics [5].
• We propose two novel filtering techniques, STF and
In the examples and experiments, the Euclidean distance
APF, respectively. The STF technique has a decent
is used. For two rectangular regions r1 and r2 , we have
filtering power and only requires the storage of very
δmax (r1 , r2 ) = max∀x∈r1 ,y∈r2 δ(x, y) and
limited pre-computed information. APF provides
the flexibility to significantly enhance the filtering 0 if r1 ∩ r2 = ∅
δmin (r1 , r2 ) = (1)
power by demanding more pre-computed informa- min∀x∈r1 ,y∈r2 δ(x, y) otherwise

3

An uncertain (location based) query Q may be de- max, sum, avg, etc., over some non-locational attributes
scribed by a continuous or a discrete distribution as fol- (e.g., weight value of the object in missile example).
lows. Example 1. In Fig. 2, S = {p1 , p2 , p3 } and Q = {q1 , q2 , q3 }
Definition 1 (Continuous Distribution). An uncertain where Pq1 = 0.4, Pq2 = 0.3 and Pq3 = 0.3. According
query Q is described by a probabilistic density function Q.pdf . to Definition 3, for the given γ, we have Pf all (p1 , γ) =
Let Q.region represent the region where Q might appear, then 0.4, Pf all (p2 , γ) = 1, and Pf all (p3 , γ) = 0.6. Therefore,
x∈Q.region
Q.pdf (x)dx = 1; Qθ,γ (S) = {p2 , p3 } if θ is set to 0.5, and hence |Qθ,γ (S)| = 2.
Definition 2 (Discrete Distribution). An uncertain query
Q consists of a set of instances (points) {q1 , q2 , . . . , qn } in a d- 2.2 Probabilistically Constrained Regions (PCR)
dimensional numerical space where qi appears with probability
Pqi , and q∈Q Pq = 1; In [26], Tao et al. study the problem of range query on
uncertain objects, in which the query is a rectangular
Note that, in Section 5 we also cover the applications
window and the location of each object is uncertain.
where Q can have a non-zero probability to be absent;
Although the problem studied in [26] is different with
that is, x∈Q.region Q.pdf (x)dx = c or q∈Q Pq = c for a
the one in this paper, in Section 3.3 we show how
c < 1.
to modify the techniques developed in [26] to support
For a point p, we use Pf all (Q, p, γ) to represent the
uncertain location based query.
probability of Q within γ distance to p, called falling
In the following part, we briefly introduce the Prob-
probability of p regarding Q and γ. It is formally defined
abilistically Constrained Region (PCR) technique devel-
below.
oped in [26]. Same as the uncertain location based query,
For continuous cases,
an uncertain object U is modeled by a probability density
function U.pdf (x) and an uncertain region U.region.
Pf all (Q, p, γ) = Q.pdf (x)dx (2)
x∈Q.region ∧ δ(x,p)≤γ The probability that the uncertain object U falls in the
rectangular window query rq , denoted by Pf all (U, rq ),

t.c om
om
For discrete cases, is defined as x∈U.region∩rq U.pdf (x)dx. In [26], the prob-

po t.c
gs po
Pf all (Q, p, γ) = Pq (3) abilistically constrained region of the uncertain object
lo s
.b og
q∈Q ∧ δ(q,p)≤γ
U regarding probability θ (0 ≤ θ ≤ 0.5), denoted by
ts .bl

U.pcr(θ), is employed in the filtering technique. Partic-
ec ts

In the paper hereafter, Pf all (Q, p, γ) is abbreviated to
oj c

ularly, U.pcr(θ) is a rectangular region constructed as
pr oje

Pf all (p, γ), and Q.region and Q.pdf are abbreviated to Q follows.
re r
lo rep

and pdf respectively, whenever there is no ambiguity. It For each dimension i, the projection of U.pcr(θ)
xp lo

is immediate that Pf all (p, γ) is a monotonically increas- [U.pcri− (θ), U.pcri+ (θ)]
ee xp

is denoted by where
.ie ee

ing function with respect to distance γ.
x∈U.region&xi ≤U.pcri− (θ) U.pdf (x)dx = θ and
w e
w .i

x∈U.region&xi ≥U.pcri+ (θ) U.pdf (x)dx = θ. Note that
w w

γ
:// w

γ xi represents the coordinate value of the point x
tp //w

p1 q1 on i-th dimension. Then U.pcr(θ) corresponds to
ht ttp:

Q q2 p3 a rectangular region [U.pcr− (θ), U.pcr+ (θ)] where
h

p2 q3 U.pcr− (θ) (U.pcr+ (θ)) is the lower (upper) corner and
the coordinate value of U.pcr− (θ) (U.pcr+ (θ)) on i-th
γ
dimension is U.pcri− (θ) (U.pcri+ (θ)). Fig. 3(a) illustrates
the U.pcr(0.2) of the uncertain object U in 2 dimensional
space. Therefore, the probability mass of U on the left
Fig. 2. Example of Pf all (Q, p, γ)
(right) side of l1− (l1+ ) is 0.2 and the probability mass of
Problem Statement. U below (above) the l2− (l2+ ) is 0.2 as well. Following
In many applications, users are only interested in the is a motivating example of how to derive the lower and
points with falling probabilities exceeding a given prob- upper bounds of the falling probability based on PCR.
abilistic threshold regarding Q and γ. In this paper we Example 2. According to the definition of PCR, in Fig. 3(b)
investigate the problem of probabilistic threshold based the probabilistic mass of U in the shaded area is 0.2, i.e.,
uncertain location range aggregate query on points data; x∈U.region&x1 ≥U.pcr1+ (θ)
U.pdf (x)dx = 0.2. Then, it is im-
it is formally defined below. mediate that Pf all (U, rq1 ) < 0.2 because rq1 does not intersect
Definition 3. [Uncertain Range Aggregate Query] Given U.pcr(0.2). Similarly, we have Pf all (U, rq2 ) ≥ 0.2 because the
a set S of points, an uncertain query Q, a query distance shaded area is enclosed by rq2 .
γ and a probabilistic threshold θ, we want to compute an The following theorem [26] formally introduces how
aggregate function (e.g., count, avg, and sum) against points to prune or validate an uncertain object U based on
p ∈ Qθ,γ (S), where Qθ,γ (S) denotes a subset of points U.pcr(θ) or U.pcr(1 − θ). Note that we say an uncertain
{p} ⊆ S such that Pf all (p, γ) ≥ θ. object is pruned (validated) if we can claim Pf all (U, rq ) < θ
In this paper, our techniques will be presented based (Pf all (U, rq ) ≥ θ) based on the P CR.
on the aggregate count. Nevertheless, they can be imme- Theorem 1. Given an uncertain object U , a range query rq
diately extended to cover other aggregates, such as min, (rq is a rectangular window) and a probabilistic threshold θ.

4

shows that techniques proposed in this section can be
U region immediately applied to the discrete case.

rq 2
l 2+ 3.1 A framework for filtering-and-verification Algo-
rithm

l 2- In this subsection, following the filtering-and-verification
paradigm we present a general framework to support
l 1+ uncertain range aggregate query based on the filtering
l 1-
rq1 technique. To facilitate the aggregate query computation,
U .pcr(0.2) U .pcr(0.2) l 1+ we assume a set S of points is organized by an aggregate
(a) PCR (b) PCR based filtering
R-Tree [22], denoted by RS . Note that an entry e of
RS might be a data entry or an intermediate entry
Fig. 3. A 2d probabilistically constrained region (PCR (0.2)) where a data entry corresponds to a point in S and an
1) For θ > 0.5, U can be pruned if rq does not fully contain intermediate entry groups a set of data entries or child
U.pcr(1 − θ); intermediate entries. Assume a filter, denoted by F , is
2) For θ ≤ 0.5, the pruning condition is that rq does not available to prune or validate a data entry (i.e., a point)
intersect U.pcr(θ); or an intermediate entry (i.e., a set of points).
3) For θ > 0.5, the validating criterion is that rq com- Algorithm 1 illustrates the framework of the filtering-
pletely contains the part of Umbb on the right (left) of and-verification Algorithm. Note that details of the fil-
plane U.pcri− (1−θ) (U.pcri+ (1−θ)) for some i ∈ [1, d], tering techniques will be introduced in the following
where Umbb is the minimal bounding box of uncertain subsections. The algorithm consists of two phases. In the
region U.region; filtering phase (Line 3-16), for each entry e of RS to be

t.c om
4) For θ ≤ 0.5 the validating criterion is that rq completely processed, we do not need to further process e if it is

om
po t.c
contains the part of Umbb on the left (right) of plane pruned or validated by the filter F . We say an entry e is
gs po
U.pcri− (θ) (U.pcri+ (θ)) for some i ∈ [1, d]; pruned (validated) if the filter can claim Pf all (p, γ) < θ
lo s
.b og
(Pf all (p, γ) ≥ θ) for any point p within embb . The counter
ts .bl

3 Filtering-and-Verification A LGORITHM cn is increased by |e| (Line 6) if e is validated where
ec ts
oj c

|e| denotes the aggregate value of e (i.e., the number
pr oje

According to the definition of falling probability (i.e.,
of data points in e). Otherwise, the point p associated
re r

Pf all (p, γ)) in Equation 2, the computation involves in-
lo rep

with e is a candidate point if e corresponds to a data
xp lo

tegral calculation, which may be expensive in terms of
entry (Line 10), and all child entries of e are put into the
ee xp

CPU cost. Based on Definition 3, we only need to know
.ie ee

queue for further processing if e is an intermediate entry
whether or not the falling probability of a particular point
w e

(Line 12). The filtering phase terminates when the queue
w .i
w w

regarding Q and γ exceeds the probabilistic threshold
:// w

is empty. In the verification phase (Line 17-21), candidate
tp //w

for the uncertain aggregate range query. This motivates
points are verified by the integral calculations according
ht ttp:

us to follow the filtering-and-verification paradigm for the
to Equation 2.
h

uncertain aggregate query computation. Particularly, in
the filtering phase, effective and efficient filtering tech- Cost Analysis. The total time cost of Algorithm 1 is as
niques will be applied to prune or validate the points. We follows.
say a point p is pruned (validated) regarding the uncertain Cost = Nf × Cf + Nio × Cio + Nca × Cvf (4)
query Q, distance γ and probabilistic threshold θ if we
can claim that Pf all (p, γ) < θ ( Pf all (p, γ) ≥ θ ) based on Particularly, Nf represents the number of entries being
the filtering techniques without explicitly computing the tested by the filter on Line 5 and Cf is the time cost
Pf all (p, γ). The points that cannot be pruned or validated for each test. Nio denotes the number of nodes (pages)
will be verified in the verification phase in which their accessed (Line 13) and Cio corresponds to the delay of
falling probabilities are calculated. Therefore, it is desirable each node (page) access of RS . Nca represents the size
to develop effective and efficient filtering techniques to of candidate set C and Cvf is the computation cost for
prune or validate points such that the number of points each verification (Line 15) in which numerical integral
being verified can be significantly reduced. computation is required. With a reasonable filtering time
In this section, we first present a general framework cost (i.e., Cvf ), the dominant cost of Algorithm 1 is
for the filtering-and-verification Algorithm based on fil- determined by Nio and Nca because Cio and Cvf might
tering techniques in Section 3.1. Then a set of filtering be expensive. Therefore, in the paper we aim to develop
techniques are proposed. Particularly, Section 3.2 pro- effective and efficient filtering techniques to reduce Nca
poses the statistical filtering technique. Then we investi- and Nio .
gate how to apply the PCR based filtering technique in Filtering. Suppose there is no filter F in Algorithm 1,
Section 3.3. Section 3.4 presents the anchor point based all points in S will be verified. Regarding the example
filtering technique. in Fig. 4, 5 points p1 , p2 , p3 , p4 and p5 will be veri-
For presentation simplicity, we consider the continuous fied. A straitforward filtering technique is based on the
case of the uncertain query in this section. Section 3.5 minimal and maximal distances between the minimal

5

Algorithm 1 Filtering-and-Verification(RS , Q, F , γ, θ) in Algorithm 1 is significantly increased. In the following
subsections, we present three filtering techniques, named
Input: RS : an aggregate R tree on data set S,
STF, PCR and APF respectively, which can significantly
Q : uncertain query, F : Filter, γ : query distance,
enhance the filtering capability of the filter.
θ : probabilistic threshold.
Output: |Qθ,γ (S)| 3.2 Statistical Filter
Description:
1: Queue := ∅; cn := 0; C := ∅; In this subsection, we propose a statistical filtering tech-
2: Insert root of RS into Queue; nique, namely STF. After introducing the motivation
3: while Queue = ∅ do of the technique, we present some important statistic
4: e ← dequeue from the Queue; information of the uncertain query and then show how
5: if e is validated by the filter F then to derive the lower and upper bounds of the falling
6: cn := cn + |e|; probability of a point regarding an uncertain query Q,
7: else distance γ and probabilistic threshold θ.
8: if e is not pruned by the filter F then Motivation. As shown in Fig. 5, given an uncertain
9: if e is data entry then query Q1 and γ we cannot prune point p based on the
10: C := C ∪ p where p is the data point e MMD technique, regardless of the value of θ, although
represented; intuitively the falling probability of p regarding Q1 is
11: else likely to be small. Similarly, we cannot validate p for Q2 .
12: put all child entries of e into Queue; This motivates us to develop a new filtering technique
13: end if which is as simple as MMD, but can exploit θ to en-
14: end if hance the filtering capability. In the following part, we
15: end if show that lower and upper bounds of Pf all (p, γ) can be
16: end while derived based on some statistics of the uncertain query.

t.c om
om
17: for each point p ∈ C do Then a point may be immediately pruned (validated)

po t.c
18: if Pf all (Q, p, γ) ≥ θ then based on the upper(lower) bound of Pf all (p, γ), denoted
gs po
19: cn := cn + 1; by U Pf all (p, γ) (LPf all (p, γ)).
lo s
.b og
20: end if Example 4. In Fig. 5 suppose θ = 0.5 and we have
ts .bl

U Pf all (Q1 , p, γ) = 0.4 (LPf all (Q2 , p, γ) = 0.6) based on
ec ts

21: end for
oj c
pr oje

22: Return cn the statistical bounds, then p can be safely pruned (vali-
re r

dated) without explicitly computing its falling probability
lo rep

bounding boxes(MBBs) of an entry and the uncertain
regarding Q1 (Q2 ). Regarding the running example in Fig.4,
xp lo

query. Clearly, for any θ we can safely prune an entry if
ee xp

suppose θ = 0.2 and we have U Pf all (p2 , γ) = 0.15 while
δmin (Qmbb , embb ) > γ or validate it if δmax (Qmbb , embb ) ≤
.ie ee

U Pf all (pi , γ) ≥ 0.2 for 3 ≤ i ≤ 5, then p2 is pruned. There-
w e

γ. We refer this as maximal/minimal distance based filter-
w .i

fore, three points (p3 , p4 and p5 ) are verified in Algorithm 1
w w

ing technique, namely MMD. MMD technique is time ef-
:// w

when MMD and statistical filtering techniques are applied.
tp //w

ficient as it takes only O(d) time to compute the minimal
ht ttp:

and maximal distances between Qmbb and embb . Recall Q1
γ
h

that Qmbb is the minimal bounding box of Q.region. Q2
g Q1
p g Q2
p1
p5
p4
Fig. 5. Motivation Example
p2
Statistics of the uncertain query
p3
To apply the statistical filtering technique, the follow-
ing statistics of the uncertain query Q are pre-computed.
Q : u n c er t ai n l oc at i on b as ed r an g e q u er y Definition 4 (mean (gQ )). gQ = x∈Q x × Q.pdf (x)dx.
Fig. 4. Running Example Definition 5 (weighted average distance (ηQ )). ηQ equals
Example 3. As shown in Fig. 4, suppose the MMD filtering x∈Q δ(x, gQ ) × Q.pdf (x)dx
technique is applied in Algorithm 1, then p1 is pruned and Definition 6 (variance (σQ )). σQ equals x∈Q δ(x, gQ )2 ×
the other 4 points p2 , p3 , p4 and p5 will be verified. Q.pdf (x)dx
Although the MMD technique is very time efficient, its Derive lower and upper bounds of Pf all (p, γ).
filtering capacity is limited because it does not make use For a point p ∈ S, the following theorem shows how
of the distribution information of the uncertain query Q to derive the lower and upper bounds of Pf all (p, γ)
and the probabilistic threshold θ. This motivates us to de- based on above statistics of Q. Then, without explicitly
velop more effective filtering techniques based on some computing Pf all (p, γ), we may prune or validate the point
pre-computations on the uncertain query Q such that the p based on U Pf all (p, γ) and LPf all (p, γ) derived based on
number of entries (i.e., points ) being pruned or validated the statistics of Q.

6

Theorem 2. Given an uncertain query Q and a distance γ, σ = V ar(Y ), according to Lemma 1 when γ > μ we
and suppose the mean gQ , weighted average distance ηQ and have
variance σQ of Q are available. Then for a point p, we have 1
1 P r(Y ≤ γ) ≤ P (Y ≥ γ ) ≤ (6)
1) If γ > μ1 , Pf all (p, γ) ≥ 1 − (γ−μ1 )2
, where μ1 = (γ −μ )2
1+ σ2
1+ 2
σ1
2 2
δ(gQ , p) + ηQ and σ1 = σQ − ηQ + 4ηQ × δ(gQ , p). Because values of μ, σ 2 , μ and σ 2 may change re-
1
2) If γ < δ(gQ , p) − ηQ − , Pf all (p, γ) ≤ (γ −μ2 )2
, garding different point p ∈ S, we cannot pre-compute
1+ 2
σ2 them. Nevertheless, in the following part we show that
where μ2
2 =Δ+ 2
= σQ −
ηQ , σ2 2
+ 4ηQ × Δ, Δ =
ηQ their upper bounds can be derived based on the statistic
γ + γ + − δ(p, gQ ) and γ > 0. The represents an information of the Q, which can be pre-computed based
arbitrarily small positive constant value. on the probabilistic distribution of Q.
Before the proof of Theorem 2, we first introduce the p
Cantelli’s Inequality [19] described by Lemma 1 which is
Q γ
one-sided version of the Chebyshev Inequality.
Lemma 1. Let X be an univariate random variable with the
expected value μ and the finite variance σ 2 . Then for any
1
C > 0, P r(X − μ ≥ C × σ) ≤ 1+C 2 . gQ
ε
Following is the proof of Theorem 2.
Proof: Intuition of the Proof. For a given point p ∈ S, p' γ'
its distance to Q can be regarded as an univariate random
variable Y , and we have Pf all (p, γ) = P r(Y ≤ γ). Given Fig. 6. Proof of Upper bound
γ, we can derive the lower and upper bounds of P r(Y ≤ Based on the triangle inequality, for any x ∈ Q we
γ) (Pf all (p, γ)) based on the statistical inequality in have δ(x, p) ≤ δ(x, gQ )+δ(p, gQ ) and δ(x, p) ≥ | δ(x, gQ )−

t.c om
om
Lemma 1 if the expectation (E(Y )) and variance(V ar(Y )) δ(p, gQ ) | for any x ∈ Q. Then we have

po t.c
of the random variable Y are available. Although E(Y )
and V ar(Y ) take different values regarding different μ =gs po y × Y.pdf (y)dy = δ(x, p) × pdf (x)dx
lo s
.b og
points, we show that the upper bounds of E(Y ) and y∈Y x∈Q
ts .bl
ec ts

V ar(Y ) can be derived based on mean(gQ ), weighted ≤ (δ(p, gQ ) + δ(x, gQ )) × pdf (x)dx
oj c
pr oje

average distance (ηQ ) and variance(σQ ) of the query Q. x∈Q
re r

≤ δ(gQ , p) + ηQ = μ1
lo rep

Then, the correctness of the theorem follows.
xp lo

Details of the Proof. The uncertain query Q is a ran-
ee xp

and
dom variable which equals x ∈ Q.region with prob-
.ie ee

ability Q.pdf (x). For a given point p, let Y denote σ2 = E(Y 2 ) − E 2 (Y )
w e
w .i
w w

the distance distribution between p and Q; that is,
:// w

Y is an univariate random variable and Y.pdf (l) = ≤ (δ(gQ , p) + δ(x, gQ ))2 pdf (x)dx
tp //w

x∈Q
ht ttp:

x∈Q.region and δ(x,p)=l Q.pdf (x)dx for any l ≥ 0. Conse-
−(δ(gQ , p) − ηQ )2
h

quently, we have Pf all (p, γ) = P r(Y ≤ γ) according to
Equation 2. Let μ = E(Y ), σ 2 = V ar(Y ) and C = γ−μ , σ = 2 δ(gQ , p) × δ(x, gQ ) × pdf (x)dx
then based on lemma 1, if γ > μ we have x∈Q

1 + δ(x, gQ )2 × pdf (x)dx
P r(Y ≥ γ) = P r(Y − μ ≥ C × σ) ≤
1+ ( γ−μ )2
σ
x∈Q
2
+2 × δ(gQ , p) × ηQ − ηQ
2 2
Then it is immediate that = σQ − ηQ + 4ηQ × δ(gQ , p) = σ1
1 Together with Inequality 5, we have P r(Y ≤ γ) ≥
P r(Y ≤ γ) ≥ 1 − P r(Y ≥ γ) ≥ 1 − (γ−μ)2
(5)
1+ 1 1
σ2 1− (γ−μ)2
≥ 1− (γ−μ1 )2
if μ1 < γ. With similar
1+ σ2
1+ 2
σ1
According to Inequation 5 we can derive the lower rationale, let Δ = δ(gQ , p ) = γ + γ + − δ(p, gQ ) we
bound of Pf all (p, γ). Next, we show how to derive have μ ≥ Δ + ηQ = μ2 and σ 2 ≤ σQ − ηQ + 4ηQ × 2
upper bound of Pf all (p, γ). As illustrated in Fig. 6, 2
Δ = σ2 . Based on Inequality 6, we have P r(Y ≤ γ) ≤
let p denote a dummy point on the line pgQ with 1 1
(γ −μ )2
≤ (γ −μ2 )2
if γ < δ(gQ , p)− ηQ − . Therefore,
δ(p , p) = γ + γ + where is an arbitrarily small 1+ σ 2
1+ 2
σ2

positive constant value. Similar to the definition of the correctness of the theorem follows.
Y , let Y be the distance distribution between p and The following extension is immediate based on the
Q; that is, Y is an univariate random variable where similar rationale of Theorem 2.
Y .pdf (l) = x∈Q.region and δ(x,p )=l Q.pdf (x)dx for any Extension 1. Suppose r is a rectangular region, we can
l ≥ 0. Then, as shown in Fig. 6, for any point x ∈ Q use δmin (r, gQ ) and δmax (r, gQ ) to replace δ(p, gQ ) in
with δ(x, p ) ≤ γ (shaded area), we have δ(x, p) > γ. This Theorem 2 for lower and upper probabilistic bounds
implies that P (Y ≤ γ) ≤ P (Y ≥ γ ). Let μ = E(Y ) and computation respectively.

7

Based on Extension 1, we can compute the upper Q.pcr(0.2) according to Theorem 1 in Section 2.2. Conse-
and lower bounds of Pf all (embb , γ) where embb is the quently, only p3 and p5 go to the verification phase when
minimal bounding box of the entry e, and hence prune Q.pcr(0.2) is available.
or validate e in Algorithm 1. Since gQ , ηQ and σQ are Same as [26], [28], a finite number of P CRs are pre-
pre-computed, the dominant cost in filtering phase is computed for the uncertain query Q regarding different
the distance computation between embb and gQ which probability values. For a given θ at query time, if the
is O(d). Q.pcr(θ) is not pre-computed we can choose two pre-
computed P CRs Q.pcr(θ1 ) and Q.pcr(θ2 ) where θ1 (θ2 )
is the largest (smallest) existing probability value smaller
3.3 PCR based Filter (larger) than θ. We can apply the modified PCR tech-
Motivation. Although the statistical filtering technique nique as the filter in Algorithm 1, and the filtering time
can significantly reduce the candidate size in Algo- regarding each entry tested is O(m+log(m)) in the worst
rithm 1, the filtering capacity is inherently limited be- case , where m is the number of P CRs pre-computed by
cause only a small amount of statistics are employed. the filter.
This motivates us to develop more sophisticated filtering The PCR technique can significantly enhance the filter-
techniques to further improve the filtering capacity; that ing capacity when a particular number of PCR s are pre-
is, we aim to improve the filtering capacity with more computed. The key of the PCR filtering technique is to
pre-computations (i.e., more information kept for the partition the uncertain query along each dimension. This
filter). In this subsection, the PCR technique proposed may inherently limit the filtering capacity of the PCR
in [26] will be modified for this purpose. based filtering technique. As shown in Fig. 7, we have
to use two rectangular regions for pruning and validation
R+ , p C p, purpose, and hence the Cp,γ is enlarged (shrunk) during
Q region
the computation. As illustrated in Fig. 7, all instances of

t.c om
Q in the striped area is counted for Pf all (p, γ) regarding

om
po t.c
p
R+,p , while all of them have distances larger than γ. Sim-
gs po
ilar observation goes to R−,p . This limitation is caused
lo s
.b og
Q . pcr ( 0 .4 )
by the transformation, and cannot be remedied by in-
ts .bl

creasing the number of P CRs. Our experiments also
ec ts

R
oj c

,p
confirm that the PCR technique cannot take advantage of
pr oje

Fig. 7. Transform query
re r

the large index space. This motivates us to develop new
lo rep

filtering technique to find a better trade-off between the
xp lo

PCR based Filtering technique. The PCR technique
ee xp

filtering capacity and pre-computation cost (i.e., index
proposed in [26] cannot be directly applied for filtering
.ie ee

size).
w e

in Algorithm 1 because the range query studied in [26]
w .i
w w

is a rectangular window and objects are uncertain. Nev-
:// w
tp //w

ertheless we can adapt the PCR technique as follows. 3.4 Anchor Points based Filter
ht ttp:

As shown in Fig. 7, let Cp,γ represent the circle (sphere)
h

centered at p with radius γ. Then we can regard the The anchor (pivot) point technique is widely employed
uncertain query Q and Cp,γ as an uncertain object and in various applications, which aims to reduce the query
the range query respectively. As suggested in [28], we computation cost based on some pre-computed anchor
can use R+,p (mbb of Cp,γ ) and R−,p (inner box) as (pivot) points. In this subsection, we investigate how to
shown in Fig. 7 to prune and validate the point p based apply anchor point technique to effectively and efficiently
on the P CRs of Q respectively. For instance, if θ = 0.4 reduce the candidate set size. Following is a motivating
the point p in Fig. 7 can be pruned according to case 2 example for the anchor point based filtering technique.
of Theorem 1 because R1 does not intersect Q.pcr(0.4). = 0.2
Note that similar transformation can be applied for the p1

intermediate entries as well. p5
p4
R+ , p1 R+ , p3
p1 p2
R+ , p5 o
p3 d p3
p5

R+ , p2
Q.mbb Co ,d
p4
p2
R+ , p4 Fig. 9. Running example regarding the anchor point
Q.mbb Q.pcr(0.2) Motivating Example. Regarding our running example,
Fig. 8. Running example in Fig. 9 the shaded area, denoted by Co,d , is the circle
centered at o with radius d. Suppose the probabilistic
Example 5. Regarding the running example in Fig. 8, mass of Q in Co,d is 0.8, then when θ = 0.2 we can safely
suppose Q.pcr(0.2) is pre-computed, then p1 , p2 and p4 prune p1 , p2 , p3 and p4 because Cpi ,γ does not intersect
are pruned because R+,p1 , R+,p2 and R+,p4 do not overlap Co,d for i = 1, 2, 3 and 4.

8

In the paper, an anchor point a regarding the uncertain Ca,δ(a,p)−γ− ⊆ Ca,δ(a,p)+γ and Cp,γ ∩ Cp,δ(a,p)−γ− = ∅,
query Q is a point in multidimensional space whose this implies that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) −
falling probability against different γ values are pre- Pf all (a, δ(a, p) − γ − ).
computed. We can prune or validate a point based on its
distance to the anchor point. For better filtering capability, Let LPf all (p, γ) and U Pf all (p, γ) denote the lower
a set of anchor points will be employed. and upper bounds derived from Lemma 2 regarding
In the following part, Section 3.4.1 presents the anchor Pf all (p, γ). Then we can immediately validate a point p if
point filtering technique. In Section 3.4.2, we investigate LPf all (p, γ) ≥ θ, or prune p if U Pf all (p, γ) < θ.
how to construct anchor points for a given space budget, Clearly, it is infeasible to keep Pf all (a, l) for arbitrary
followed by a time efficient filtering algorithm in Sec- l ≥ 0. Since Pf all (a, l) is a monotonic function with
tion 3.4.3. respect to l, we keep a set Da = {li } with size nd for each
anchor point such that Pf all (a, li ) = nid for 1 ≤ i ≤ nd .
3.4.1 Anchor Point filtering technique (APF) Then for any l > 0, we use U Pf all (a, l) and LPf all (a, l)
For a given anchor point a regarding the uncertain query to represent the upper and lower bound of Pf all (a, l)
Q, suppose Pf all (a, l) is pre-computed for arbitrary dis- respectively. Particularly, U Pf all (a, l) = Pf all (a, li ) where
tance l. Lemma 2 provides lower and upper bounds li is the smallest li ∈ Da such that li ≥ l. Similarly,
of Pf all (p, γ) for any point p based on the triangle LPf all (a, l) = Pf all (a, lj ) where lj is the largest lj ∈ Da
inequality. This implies we can prune or validate a point such that lj ≤ l. Then we have the following theorem by
based on its distance to an anchor point. rewriting Lemma 2 in a conservative way.
Theorem 3. Given an uncertain query Q and an anchor point
Q S2
a, for any rectangular region r and distance γ, we have:
ε 1) If γ > δmax (a, r), Pf all (r, γ) ≥ LPf all (a, γ −
a p γ γ δmax (a, r)).

t.c om
om
a p 2) Pf all (r, γ) ≤ U Pf all (a, δmax (a, r) + γ)

po t.c
−LPf all (a, δmin (a, r) −γ − ) where is an arbitrarily
gs po
Q
γ − δ (a , p ) S1 lo ssmall positive value.
.b og
ts .bl

Let LPf all (r, γ) and U Pf all (r, γ) represent the lower
ec ts

(a) Lower Bound (b) Upper Bound
oj c

and upper bounds of the falling probability derived from
pr oje

Fig. 10. Lower and Upper Bound Theorem 3. We can safely prune (validate) an entry e if
re r
lo rep

Lemma 2. Let a denote an anchor point regarding the U Pf all (embb , γ) < θ (LPf all (embb , γ) ≥ θ). Recall that embb
xp lo

represents the minimal bounding box of e. It takes O(d)
ee xp

uncertain query Q. For any point p ∈ S and a distance γ, we
time to compute δmax (a, embb ) and δmin (a, embb ). Mean-
.ie ee

have
w e

while, the computation of LPf all (a, l) and U Pf all (a, l) for
w .i

1) If γ > δ(a, p), Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)).
w w

any l > 0 costs O(log nd ) time because pre-computed
:// w

2) Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) −
tp //w

distance values in Da are sorted. Therefore, the filtering
γ − ) where is an arbitrarily small positive value. 1
ht ttp:

time of each entry is O(d + log nd ) for each anchor point.
h

Proof: Suppose γ > δ(a, p), then according to the tri-
angle inequality for any x ∈ Q with δ(x, a) ≤ γ − δ(a, p), 3.4.2 Heuristic with a finite number of anchor points
we have δ(x, p) ≤ δ(a, p)+δ(x, a) ≤ δ(a, p)+(γ−δ(a, p)) = Let AP denote a set of anchor points for the uncertain
γ. This implies that Pf all (p, γ) ≥ Pf all (a, γ − δ(a, p)) query Q. We do not need to further process an entry e in
according to Equation 2. Fig. 10(a) illustrates an example Algorithm 1 if it is filtered by any anchor point a ∈ AP.
of the proof in 2 dimensional space. In Fig. 10(a), we have Intuitively, the more anchor points employed by Q, the
Ca,γ−δ(a,p) ⊆ Cp,γ if γ > δ(a, p). Let S denote the striped more powerful the filter will be. However, we cannot
area which is the intersection of Ca,γ−δ(a,p) and Q. employ a large number of anchor points due to the space
Clearly, we have Pf all (a, γ − δ(a, p)) = x∈S Q.pdf (x)dx and filtering time limitations. Therefore, it is important
and δ(x, p) ≤ γ for any x ∈ S. Consequently, Pf all (p, γ) to investigate how to choose a limited number of anchor
≥ Pf all (a, γ − δ(a, p)) holds. points such that the filter can work effectively.
With similar rationale, for any x ∈ Q we have
δ(x, a) ≤ δ(a, p) + γ if δ(x, p) ≤ γ. This implies Anchor points construction. We first investigate how
that Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ). Moreover, for to evaluate the “goodness” of an anchor point regard-
any x ∈ Q with δ(x, a) ≤ δ(a, p) − γ − , we have ing the computation of LPf all (p, γ). Suppose all anchor
δ(x, a) > γ. Recall that represents an arbitrarily small points have the same falling probability functions; that
constant value. This implies that x does not contribute is Pf all (ai , l) = Pf all (aj , l) for any two anchor points
to Pf all (p, γ) if δ(x, a) ≤ δ(a, p) − γ − . Consequently, ai and aj . Then the closest anchor point regarding p
Pf all (p, γ) ≤ Pf all (a, δ(a, p) + γ) − Pf all (a, δ(a, p) − γ − ) will provide the largest LPf all (p, γ). Since there is no a
holds. As shown in Fig. 10(b), we have Pf all (p, γ) ≤ priori knowledge about the distribution of the points, we
Pf all (a, δ(a, p) + γ) because Cp,γ ⊆ Ca,δ(a,p)+γ . Since assume they follow the uniform distribution. Therefore,
anchor points should be uniformly distributed. If falling
1. We have Pf all (a, δ(a, p) − γ − ) = 0 if δ(a, p) ≤ γ probabilistic functions of the anchor points are different,

Efficient computation of range aggregates

Efficient computation of range aggregates

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (7)

Plus de ingenioustech

Plus de ingenioustech (18)

Dernier

Dernier (20)

Efficient computation of range aggregates