SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
Cooperative Scans:
                           Dynamic Bandwidth Sharing in a DBMS

                                Marcin Zukowski, Sandor Heman, Niels Nes, Peter Boncz
                                                   ´     ´
                                             Firstname.Lastname@cwi.nl
                                        Centrum voor Wiskunde en Informatica
                                 Kruislaan 413, 1098 SJ Amsterdam, The Netherlands

ABSTRACT                                                                           policy is employed, while in both attach and elevator incom-
This paper analyzes the performance of concurrent (index)                          ing queries can join an ongoing scan in case there is overlap
scan operations in both record (NSM/PAX) and column                                in data need, with the main difference that elevator employs
(DSM) disk storage models and shows that existing schedul-                         a single, strictly sequential scan cursor, while attach allows
ing policies do not fully exploit data-sharing opportunities                       for multiple (shared) cursors. Benchmarks show that they
and therefore result in poor disk bandwidth utilization. We                        provide sharing of disk bandwidth and buffer space only in
propose the Cooperative Scans framework that enhances per-                         a limited set of scenarios. This is mostly caused by the fact
formance in such scenarios by improving data-sharing be-                           that the disk access order is predefined when a query en-
tween concurrent scans. It performs dynamic scheduling of                          ters the system, hence it can not be adjusted to optimize
queries and their data requests, taking into account the cur-                      performance in dynamic multi-query scenarios.
rent system situation. We first present results on top of                                       Scan   Scan      Scan       CScan   CScan   CScan
an NSM/PAX storage layout, showing that it achieves sig-
nificant performance improvements over traditional policies                                 Buffer                      Active
in terms of both the number of I/Os and overall execution                                  Manager                     Buffer
                                                                                                                       Manager
time, as well as latency of individual queries. We provide
benchmarks with varying system parameters, data sizes and
query loads to confirm the improvement occurs in a wide                                                       pages                     chunk

range of scenarios. Then we extend our proposal to a more
complicated DSM scenario, discussing numerous problems
related to the two-dimensional nature of disk scheduling in                           Figure 1: Normal Scans vs Cooperative Scans
column stores.
                                                                                      To overcome these limitations, we introduce the Cooper-
                                                                                   ative Scans framework, depicted in Figure 1. It involves
1. INTRODUCTION                                                                    CScan – a modified (index) Scan operator, that announces
   In traditional database research disk scans were mostly                         the needed data ranges upfront to an active buffer manager
considered trivial, and simple LRU or MRU buffering poli-                           (ABM). The ABM dynamically optimizes the order of disk ac-
cies were proposed for them [6, 23]. We show that if scans                         cesses, taking into account all current CScan requests on a
start at different times, these policies achieve only a low                         relation (or a set of clustered relations). This framework can
amount of buffer reuse. To improve this situation, some                             run the basic normal, attach and elevator policies, but also
systems support the concept of circular scans [9, 10, 20, 14]                      a new policy, relevance, that is central to our proposal. Be-
which allows queries that start later to attach themselves to                      sides optimizing throughput, the relevance policy also min-
already active scans. As a result, the disk bandwidth can be                       imizes latency. This is done by departing from the strictly
shared between the queries, resulting in a reduced number                          sequential access pattern as present in attach and elevator.
of I/O requests. However, this strategy is not efficient when                        Instead, relevance makes page load and eviction decisions
queries process data at different speeds or a query scans only                      based on per-page relevance functions, which, for example,
a range instead of a full table.                                                   try to evict pages with a low number of interested queries
   In this paper we analyze the performance of existing scan                       as soon as possible, while prioritizing page reads for short
strategies, identifying three basic approaches: normal, at-                        queries and pages that have many interested queries.
tach and elevator. In normal, a traditional LRU buffering                              To further illustrate the need for a more flexible approach
                                                                                   to I/O scheduling, consider the following example. Assume
                                                                                   a system has to execute two queries, Q1 and Q2 , which enter
Permission to copy without fee all or part of this material is granted provided    the system at the same time and process data at the same
that the copies are not made or distributed for direct commercial advantage,       speed. Q1 needs to read 30 pages and is scheduled first, while
the VLDB copyright notice and the title of the publication and its date appear,    Q2 needs 10 different pages. If those queries get serviced in
and notice is given that copying is by permission of the Very Large Data           a round-robin fashion, as in the normal policy, Q2 finishes
Base Endowment. To copy otherwise, or to republish, to post on servers             after 20 pages are loaded, and Q1 after 40, giving an average
or to redistribute to lists, requires a fee and/or special permission from the     query latency of 30. The elevator policy may perform better,
publisher, ACM.
VLDB ‘07, September 23-28, 2007, Vienna, Austria.                                  by first fully servicing Q2 and then Q1 , reducing the average
Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.                        waiting time from 30 to 25. Still, elevator can choose the




                                                                             723
opposite order, resulting in waiting times of 30 and 40, hence              The underlying hardware trends of the past decades, namely
actually increasing the average time. With relevance, we aim                sustained exponential growth in CPU power as well as much
to get close to the optimal average query latency, without                  faster improvement in disk-bandwidth than in I/O latency,
relying on the sequential scan order, by making flexible I/O                 are expected to continue. Thus, to keep each next CPU gen-
scheduling decisions.                                                       eration with more cores busy, the number of disks will need
Contributions. We view our contributions as follows:                        to be doubled to achieve system balance.
(i) the definition of the Cooperative Scans framework, and                      This exponential trend is clearly unsustainable, and one
the new relevance policy. (ii) experiments that help better                 can argue that in the real world (i.e. outside manufacturer
understand the behavior of the existing normal, attach and                  benchmarking projects) it is already no longer being sus-
elevator scheduling policies, and also show that the relevance              tained, and database servers are often configured with fewer
policy outperforms them in a wide range of scenarios, both                  disks than optimal. The main reason for this is cost, both in
for row- and column-stores. (iii) showing that I/O schedul-                 terms of absolute value of large I/O subsystems (nowadays
ing for column-stores is significantly more complex than for                 taking more than two thirds of TPC-H benchmark systems
row-storage; a characteristic that so far seems to have been                cost, see Table 1), but also maintenance costs. In a multi-
overlooked.                                                                 thousand disk configuration, multiple disks are expected to
Outline. In Section 2 we motivate our research, arguing                     break each day [25], which implies the need for full-time
that efficiently handling (index) scans is an important topic                 attendance by a human system administrator.
and will become even more so in the future, given current                   Better Scanning. We argue that the only way to avoid
hardware and application trends. Section 3 analyzes exist-                  random I/O is to rely more on (clustered index) scans, which
ing approaches to scan processing. In Section 4 we intro-                   depend on sequential disk bandwidth rather than latency.
duce the Cooperative Scans framework for row stores and                     Modern data warehousing systems try to achieve this by:
we validate its performance in Section 5. In Section 6 we                   (1) storing relations redundantly in multiple orders [2], such
extend Cooperative Scans to column stores, that recently                    that more query patterns can use a clustered access path.
gained popularity. The incorporation of ABM into an exist-                  To avoid the costs of updating multiple such tables, updates
ing DBMS is discussed in Section 7, where we also explore                   in such systems are buffered in RAM in differential lists and
possibilities of adapting order-aware query processing opera-               are dynamically merged with persistent data.
tors to handle out-of-order data delivery. We discuss related               (2) exploiting correlated orderings. In MonetDB/X100, a
work in Section 8 before concluding in Section 9.                           min- and max-value is kept for each column per large disk
                                                                            block. Such meta-data, similar to “small materialized ag-
2. MOTIVATION                                                               gregates” [19] and also found e.g. in the Netezza system as
                                                                            “zonemaps” [21], allows avoiding reading unneeded blocks
   Database systems are addicted to random disk I/O, caused                 during an (index) scan, even if the data is not ordered on
by unclustered index lookups, and hardware trends are push-                 that column, but on a correlated column. For example, in
ing this model to the limit of sustainability. Foreign-key                  the lineitem table in the TPC-H schema it allows avoiding
joins, as well as selection predicates executed using unclus-               I/O for almost all non-relevant tuples in range selections on
tered indices both may yield large streams of row-IDs for                   any date column in the TPC-H schema, as dates in the fact
looking up records in a target table. If this target table is               tables of a data warehouse tend to be highly correlated. This
large and the accesses are scattered, the needed disk pages                 technique can sometimes result in a scan-plan that requires
will have to be fetched using random disk I/O. To optimize                  a set of non-contiguous table ranges.
performance, industry-strength RDBMSs make good use of                      (3) using multi-table clustering or materialized views, to ex-
asynchronous I/O to farm out batches of requests over mul-                  ploit index range-scans even over foreign-key joins.
tiple disk drives, both to achieve I/O parallelism between                  (4) exploiting large RAMs to fully buffer small (compressed)
the drives, and to let each disk handle multiple I/Os in a                  relations, e.g. the dimensions of a star schema.
single arm movement (amortizing some access latency).                       (5) reducing scan I/O volume by offering column storage
                                                                            (DSM) as an option [33, 28] to avoid reading unused columns.
            processing                disks       throughput
# CPU                          RAM # totsize cost single 5-way              The same remark as made in (1) on handling updates ap-
4   Xeon 3.0GHz dual-core      64GB   124   4.4TB   47%   19497   10404     plies here. However, I/O scheduling in column stores can be
2   Opteron 2GHz               48GB   336   6.0TB   80%   12941   11531     significantly more complex, as shown in Section 6.
4   Xeon 3.0GHz dual-core      32GB    92   3.2TB   67%   11423    6768     (6) using lightweight compression, where the reduced I/O
2   Power5 1.65GHz dual-core   32GB    45   1.6TB   65%    8415    4802     cost due to size reduction outweighs the CPU cost of de-
                                                                            compression. It has been shown that with column storage,
     Table 1: Official 2006 TPC-H 100GB results                               (de-)compression becomes less CPU intensive and achieves
                                                                            better compression ratios [33, 1].
   Massive unclustered disk access occurs frequently in bench-
                                                                            (7) using on-disk processing for reducing disk-memory traf-
marks like TPC-H, and it is not uncommon now to see
                                                                            fic and main-CPU load. For example, Netezza uses pro-
benchmark configurations that use hundreds or thousands
                                                                            grammable FPGAs [21] to quickly perform selection on data
of disks. For example, Table 1 shows that the four most
                                                                            before it gets to the main memory.
recent TPC-H submissions of even the smallest 100GB data
                                                                               After applying (1-2), data warehousing queries use (clus-
size used an average 150 disks with total storage capacity of
                                                                            tered index) scans for their I/O, typically selecting ranges
3.8 terabyte. All these disks are less than 10% full, and the
                                                                            from the fact tables. Other table I/O can be reduced using
main reason for their high number is to get more disk-arms,
                                                                            (3-4) and I/O bandwidth can be optimized to its minimum
allowing for a higher throughput of random I/O requests.
                                                                            cost by (5-7). The challenge addressed here is that if this
Note from Table 1 that a high number of disks seems es-
                                                                            mode of operation is to be successful, the DBMS must be
pecially crucial in the concurrent (5 stream) query scenario.




                                                                      724
Probability of finding a useful chunk    1
                                                                                                                          not try to reuse data shared by different running queries. In
                                                                                                                          a dynamic environment, with multiple partial scans running
                                         0.8                                                                              at the same time, it is likely that the buffer pool contains
                                                                                                                          some data that is useful for a given query. With a table
                                         0.6
                                                                                                                          consisting of CT chunks, a query that needs CQ chunks and
                                         0.4                                                                              a buffer pool of CB chunks, the probability of finding some
                                                                                                   50% buffered           useful data in the randomly-filled buffer is:
                                                                                                   20% buffered
                                         0.2                                                       10% buffered
                                                                                                    5% buffered
                                                                                                    1% buffered                                        CB −1
                                          0                                                                                                             Y CT − CQ − i
                                               0   10   20      30    40     50     60      70     80   90    100
                                                                                                                                        Preuse = 1 −                                 (1)
                                                             Chunks needed by a query (out of 100)
                                                                                                                                                        i=0
                                                                                                                                                            CT − i

Figure 2: Probability of finding a useful chunk in a
randomly-filled buffer pool, with varying buffer pool                                                                        As Figure 2 shows, even for small scanned ranges and buffer
size and query demand                                                                                                     sizes, this probability can be high, e.g. over 50% for a 10%
                                                                                                                          scan with a buffer pool holding 10% of the relation. Unfor-
                                                                                                                          tunately, the normal policy, by enforcing a sequential order
capable of handling many concurrent scan queries efficiently.                                                               of data delivery, at a given time can use only a single page,
However, unlike asynchronous random I/O, where concur-                                                                    reducing this probability to CB /CT .
rent batches of random I/Os may allow for better amortized                                                                   In many cases it is possible to relax the requirement of se-
disk seeks thanks to request reordering, the naive approach                                                               quential data delivery, imposed by normal. Even when using
to concurrent scans causes queries to fight for sequential read                                                            a clustered index for attribute selection, consuming opera-
bandwidth, reducing overall throughput. We should note                                                                    tors often do not need data in a particular order. This allows
that we do not focus on overall query throughput alone as                                                                 for scheduling policies with “out-of-order” data delivery.
our efficiency metric: it is equally important that average                                                                    A simple idea of sharing disk access between the over-
query latency is minimized. The rationale here is that in                                                                 lapping queries is used in the attach strategy. When a
many scenarios, for example when an application submits                                                                   query Qnew enters the system, it looks at all other running
queries one after another, query latencies significantly influ-                                                             scans, and if one of them (Qold ) is overlapping, it starts
ence the execution time of a query stream and hence the                                                                   to read data at the current Qold ’s position. To optimize
observed system throughput.                                                                                               performance, attach should choose a query that has the
  Our work exploits the observation that concurrent scans                                                                 largest remaining overlap with Qnew . Once Qnew reaches
have a high expected amount of overlapping data need. This                                                                the end of its desired range, it starts from the beginning
creates the opportunity for synchronizing queries, such that                                                              until reaching the original position. This policy, also known
they share buffered data and thus reduce the demand for                                                                    as “circular scans” or “shared scans”, is used among oth-
disk bandwidth. In the following we discuss how existing                                                                  ers in Microsoft SQLServer [10], RedBrick [9], and Tera-
systems exploit this opportunity and propose a new tech-                                                                  data [20], and allows significant performance improvement
nique that further improves bandwidth sharing.                                                                            in many scenarios. The attach policy, however, may suffer
                                                                                                                          from three problems. First, if one query moves much faster
                                                                                                                          than the other, the gap between them may become so large
3. TRADITIONAL SCAN PROCESSING                                                                                            that pages read by the fastest query are swapped out before
   With multiple scans running concurrently, in a naive im-                                                               the slower reaches them (they “detach”). Second, if queries
plementation sequential requests from different queries can                                                                are range scans, it is possible that one of the queries that
interleave, causing frequent disk-arm movements and result-                                                               process data together finishes, and the other continues by
ing in a semi-random access pattern and low overall disk                                                                  itself, even though it could attach to another running query.
throughput. To avoid this problem, most database systems                                                                  For example, if a full scan is underway but not yet in its
execute such scans using large isolated I/O requests span-                                                                range, attach misses this sharing opportunity. Finally, when
ning over multiple pages, together with physical clustering                                                               exploiting per-block meta-data, the scan request can consist
of table pages. As a result, the overhead of shifting the disk-                                                           of multiple ranges, making it even harder to benefit from
arm is amortized over a large chunk of data, resulting in an                                                              sharing a scan with a single query. As a result, the upper
overall bandwidth comparable to a standalone scan.                                                                        bound on the number of I/Os performed by attach is the
   Even when using bandwidth-efficient chunk-based I/O,                                                                     same as in normal.
different scheduling policies are used for concurrent scans.                                                                  The elevator policy is a variant of attach that addresses
The most naive, called normal in the rest of the paper,                                                                   its problems by enforcing strict sequential reading order of
performs scans by simply reading all disk blocks requested                                                                the chunks for the entire system. This optimizes the disk la-
by a query in a sequential fashion, using an LRU policy for                                                               tency and minimizes the number of I/O requests, and thus
buffering. The disadvantage of LRU is that if one query                                                                    leads to good disk bandwidth and query throughput. How-
starts too long after the other, the loaded pages will already                                                            ever, the problem here is that query speed degenerates to the
be swapped out before they can be reused. As a result,                                                                    speed of the slowest query, because all queries wait for each
assuming there is no buffer reuse between the queries, and                                                                 other. Also, range queries often need to wait a long time be-
queries are serviced in a round-robin fashion, the expected                                                               fore the reading cursor reaches the data that is interesting
number of I/Os performed in the system until a new query                                                                  for them. In principle, in the worst case the number of I/Os
Qnew that reads Cnew chunks finishes can be estimated by:
         P                                                                                                                performed by a system before a fresh query Qnew finishes
                                                                                                                                                      P
Cnew + q∈queries M IN (Cnew , Cq ).                                                                                       can be M IN (CT , Cnew + q∈queries Cq ), where CT is the
   The major drawback of the normal policy is that it does                                                                number of chunks in the entire table.




                                                                                                                    725
CScan process
4. COOPERATIVE SCANS                                                selectChunk(qtrigger )
   The analysis in the previous section, further confirmed           | if finished(qtrigger )
by results in Section 5, demonstrates that existing scan-           | | return NULL
                                                                    | else
processing solutions that try to improve over the normal            | | if abmBlocked()
policy still suffer from multiple inefficiencies. In this sec-         | | | signalQueryAvailable()
tion we propose a new “Cooperative Scans” framework that            | | chunk = chooseAvailableChunk(qtrigger )
                                                                    | | if (chunk == NULL)
avoids these problems. As Figure 1 presents, it consists of a       | | | chunk = waitForChunk(qtrigger )
cooperative variant of the traditional (index) Scan operator,       | | return chunk
named CScan, and an Active Buffer Manager (ABM).                     chooseAvailableChunk(qtrigger )
   The new CScan operator registers itself as an active scan        | cavailable = NULL, U = 0
on a range or a set of ranges from a table or a clustered           | foreach c in interestingChunks(qtrigger )
                                                                    | | if chunkReady(c) and useRelevance(c) > U
index. CScan has much the same interface as the normal Scan         | | | U = useRelevance(c)
operator, but it is willing to accept that data may come in a       | | | cavailable = c
different order. Note that some query plans exploit column           | return cavailable
ordering present on disk. We discuss integration of such                                     ABM process
queries in our framework in Section 7.                              main()
                                                                    | while (true)
   The Active Buffer Manager (ABM) extends the traditional           | | query = chooseQueryToProcess()
buffer manager in that it keeps track of CScan operators and         | | if query == NULL
which parts of the table are still needed by each of them,          | | | blockForNextQuery()
                                                                    | | | continue
and tries to schedule disk reads such that multiple concur-         | | chunk = chooseChunkToLoad(query)
rent scans reuse the same pages. The overall goal of the ABM        | | slot = findFreeSlot(query)
is to minimize average query cost, keeping the maximum              | | loadChunk(chunk, slot)
                                                                    | | foreach q in queries
query execution cost reasonable (i.e. ensuring “fair” treat-        | | | if (chunkInteresting(q, chunk) and queryBlocked(q)
ment of all queries). As discussed before, scan processing is       | | | | signalQuery(q, chunk)
usually performed with large I/O units we call chunks, to           chooseQueryToProcess()
achieve good bandwidth with multiple concurrent queries.            | relevance = −∞, query = NULL
Note, that a chunk in memory does not have to be contigu-           | foreach q in queries
                                                                    | | qr = queryRelevance(q)
ous, as it can consists of multiple pages filled in with a single    | | if (query == NULL or qr > relevance)
scatter-gather I/O request. In our framework there are two          | | | relevance = qr
more reasons for using chunks. First, the number of chunks          | | | query = q
                                                                    | return query
is usually one or two orders of magnitude smaller than the
number of pages, thus it becomes possible to have chunk-            chooseChunkToLoad(qtrigger )
                                                                    | cload = NULL, L = 0
level scheduling policies that are considerably more complex        | foreach c in interestingChunks(qtrigger )
than page-level policies. Secondly, chunks are logical enti-        | | if (not chunkReady(c)) and loadRelevance(c) > L
ties whose boundaries may not even correspond exactly to            | | | L = loadRelevance(c)
page boundaries, a feature that will be exploited in the more       | | | cload = c
                                                                    | return cload
complex scenarios with column-based storage.
                                                                    findFreeSlot(qtrigger )
   In our system, the Cooperative Scans framework imple-            | sevict = NULL, K = ∞
ments the traditional scan-processing policies: normal, at-         | foreach s in slots
tach and elevator. However, its main benefit comes from              | | if empty(s)
a newly introduced relevance policy that takes scheduling           | | | return s
                                                                    | | c = chunkInSlot(s)
decisions by using a set of relevance functions. Both the           | | if (not currentlyUsed(s)) and (not interesting(c, qtrigger ))
CScan and ABM processes, as well as the relevance functions         | |     and (not usefulForStarvedQuery(c))
used by them, are described in Figure 3.                            | |     and keepRelevance(c) < K
                                                                    | | | K = keepRelevance(c)
   As the top part of Figure 3 illustrates, the CScan process is    | | | sevict = s
called on behalf of a certain query, qtrigger , that contains a     | freeSlot(sevict )
CScan operator in its query plan. Each time qtrigger needs a        | return sevict
chunk of data to process, selectChunk is called. This triggers                        NSM Relevance Functions
a search over all buffered chunks that still need to be pro-         queryRelevance(q)
                                                                    | if not queryStarved(q)
cessed by the query, in chooseAvailableChunk, and returns           | | return −∞
the most relevant one, as governed by useRelevance. If no           | return - chunksNeeded(q) +
such chunk is available, the operator blocks until the ABM          | | waitingTime(q) / runnningQueries()
process loads a chunk that is still needed by qtrigger . Our        useRelevance(c, qtrigger )
useRelevance function promotes chunks with the smallest             | return Qmax − numberInterestedQueries(c)
number of interested queries. By doing so, the less interest-       loadRelevance(c)
                                                                    | return numberInterestedStarvedQueries(c) * Qmax
ing chunks will be consumed early, making it safe to evict          |   + numberInterestedQueries(c)
them. This also minimizes the likelihood that less interest-
                                                                    keepRelevance(c, qtrigger )
ing chunks will get evicted before they are consumed.               | return numberInterestedAlmostStarvedQueries(c) * Qmax
   The ABM thread continuously monitors all currently run-          |  + numberInterestedQueries(c)
ning queries and their data needs. It schedules I/O requests        queryStarved(qtrigger )
on behalf of the query with the highest priority, considering       | return numberOfAvailableChunks(qtrigger ) < 2
the current system state. For this query, it chooses the most
relevant chunk to load, possibly evicting the least relevant         Figure 3: Pseudo-code for the Relevance policy




                                                              726
chunk present in the buffer manager. Once a new chunk               benchmark data with scale factor 10. In this setting the
is loaded into the ABM, all blocked queries interested in that     lineitem table consumes over 4GB of disk space. The other
chunk are notified. This is the core functionality of ABM’s         tables are fully cached by the system.
main loop, as found in the middle part of Figure 3.                Queries: To allow flexible testing of our algorithms we have
   The chooseQueryToProcess call is responsible for finding         chosen two queries based on the TPC-H benchmark. Query
the highest priority query, according to queryRelevance, to        FAST (F) is TPC-H Q6, which is a simple aggregation.
load a chunk for. This queryRelevance function considers           Query SLOW (S) is TPC-H Q1 with extra arithmetic com-
non-starved queries (i.e. a queries that have 2 or more avail-     putations to make it more CPU intensive. For all queries we
able chunks, including the one they are currently processing)      allow arbitrary scaling of the scanned table range. In this
equal, assigning them the lowest priority possible. Starved        section we use the notation QUERY-PERCENTAGE, with
queries are prioritized according to the amount of data they       QUERY representing the type of query, and PERCENTAGE
still need, with shorter queries receiving higher priorities.      the size of the range being scanned. For example, with F-
However, to prevent the longer queries from being starved          10 we denote query FAST, reading 10% of the full relation
forever, the priorities are adjusted to also promote queries       from a random location. We use multiple query streams,
that are already waiting for a long time. By prioritizing          each sequentially executing a random set of queries. There
short queries, ABM tries to avoid situations where chunks are      is a 3 second delay between starting the streams, to better
assigned to queries in a round-robin fashion, as this can          simulate queries entering an already-working system.
have a negative impact on query latency. Besides, a chunk
loaded for a short query has a higher chance of being useful       5.2 Comparing scheduling policies
to some large query than the other way around. In case
                                                                      Table 2 shows the results for all scheduling policies when
ABM does not find a query to schedule a chunk for, it blocks
                                                                   running 16 streams of 4 queries. We used a mix of slow
in blockForNextQuery, until the CScan operator wakes it up
                                                                   and fast queries with selectivity of 1%, 10%, 50% and 100%.
again using signalQueryAvailable.
                                                                   The two major system-wide results are the average stream
   Once ABM has found a query to schedule a chunk for, it calls
                                                                   running time, that represents the system throughput, and
the chooseChunkToLoad routine to select a not yet loaded
                                                                   the average normalized latency of a query (running time in
chunk that still needs to be processed by the selected query.
                                                                   this benchmark divided by the base time, when the query
The loadRelevance function determines which chunk will ac-
                                                                   runs by itself with an empty buffer), that represents the
tually be loaded, not only by looking at what is relevant
                                                                   system latency. Additionally we provide the total execution
to the current query, but also taking other queries needs
                                                                   time, CPU-utilization, and the number of issued I/Os. The
into consideration. To maximize sharing, it promotes chunks
                                                                   difference between the total time and the average stream
that are needed by the highest number of starved queries,
                                                                   time comes from the random distribution of queries in the
while at the same time slightly adjusting priorities to prefer
                                                                   streams, resulting in a significant variance of stream running
chunks needed by many non-starved queries.
                                                                   times. For each query type and policy we provide the aver-
   If there are no free slots in the buffer pool, ABM’s find-
                                                                   age latency, normalized latency and number of I/Os issued
FreeSlot routine needs to swap out the chunk with the low-
                                                                   when scheduling this query type. Additionally, Figure 4
est keepRelevance. This function is similar to loadRelevance,
                                                                   presents a detailed analysis of the I/O requests issued by
except that when looking at queries, we treat queries on the
                                                                   each policy.
border of starvation as being starved, to avoid evicting their
                                                                      As expected, the normal policy achieves the worst perfor-
chunks, which would make them starved, hence schedulable,
                                                                   mance. As Figure 4 shows, it maintains multiple concurrent
immediately.
                                                                   sequential scans, which leads to the largest number of I/O
   The relevance policy tries to maximize buffer pool reuse
                                                                   requests and a minimal buffer reuse. Since the query load is
without slowing down fast queries. Thus, a slow query will
                                                                   relatively CPU-intensive, it still manages to use a significant
re-use some of the chunks loaded by a fast query, skipping
                                                                   fraction of the CPU time.
over chunks that it was too slow to process. These are read
                                                                      The attach policy allows merging requests from some queries.
again later in the process, when the fast query might already
                                                                   As a result, it consistently improves the performance of all
be gone. The access pattern generated by this approach may
                                                                   query types and the system throughput. Still, in Figure 4
be (quasi-) random, but since chunks consist of multiple
                                                                   we see that there are multiple (albeit fewer than in normal)
sequential pages, disk (arm) latency is still well amortized.
                                                                   concurrent scans, since not all queries can share the same
                                                                   chunk sequence. Additionally, we can see that a faster query
5. ROW-WISE EXPERIMENTS                                            can detach from a slower one (circled), resulting in a split
                                                                   of a reading sequence, further reducing the performance.
5.1    Benchmark settings                                             The elevator policy shows a further reduction of the I/O
                                                                   requests and improvement of system throughput. This is a
Benchmark system: We carried out row storage exper-                result of its simple I/O pattern seen in Figure 4. However,
iments on MonetDB/X100, using the PAX storage model,               we see that the average normalized latency is very bad for
which is equivalent to NSM in terms of I/O demand. The             this policy. This is caused by the short queries that suffer
chunk size used was 16MB, and the ABM buffer-pool size              from a long waiting time, and achieve results even worse
was set to 64 chunks (1GB), unless stated otherwise. Direct        than in normal. This blocking of queries also degrades the
I/O was used, to avoid operating system buffering. Our test         overall system time, since it delays the start moment of the
machine was a dual-CPU AMD Opteron 2GHz system with                next query in a given stream. We also see that fast and slow
4GB of RAM. The storage facility was a 4-way RAID sys-             queries differ little in performance - this is caused by the
tem delivering slightly over 200 MB/s.                             fast queries waiting for the slow ones.
Benchmark dataset: We used the standard TPC-H [30]                    Our new relevance policy achieves the best performance,




                                                             727
Normal                       Attach                       Elevator              Relevance
                                                               System statistics
Avg. stream time                                      283.72                    160.81                    138.41                        99.55
Avg. normalized latency                                 6.42                      3.72                     13.52                         1.96
Total time                                            453.06                    281.19                    244.45                       238.16
CPU use                                              53.20%                    81.31%                    90.20%                       93.94%
I/O requests                                            4186                      2325                      1404                         1842
                                                               Query statistics
query      count    standalone     latency(sec)  norm. I/Os       latency(sec)  norm. I/Os   latency(sec) norm.    latency(sec) norm. I/Os
                      cold time    avg    stddev lat.             avg    stddev lat.         avg stddev lat.       avg stddev lat.
F-01       9              0.26   1.71 1.02       6.58   2   1.02         0.49   3.92     2    5.31    7.33 20.42 0.52 0.36          2.00      2
F-10       7              2.06 13.97 5.69        6.78 23    6.23         2.56   3.02    18   15.17    8.63 7.36 3.30 1.30           1.60     18
F-50       6             10.72 103.59 14.96      9.66 78 58.77          10.96   5.48    67   44.87    7.92 4.19 18.21 6.64          1.70     43
F-100      9             20.37 192.82 31.56      9.47 153 96.98         23.33   4.76    69   59.60   19.57 2.93 29.01 8.17          1.42     69
S-01       13             0.38   1.67 1.25       4.39   2   1.19         0.65   3.13     3   15.01   15.04 39.50 0.55 0.29          1.45      2
S-10       6              3.55 21.58 5.11        6.08 19 15.12           4.08   4.26    24   20.29   23.93 5.72 11.30 5.98          3.18     22
S-50       6             17.73 78.23 29.07       4.41 95 46.98          16.82   2.65    79   37.39   14.23 2.11 37.77 15.66         2.13     48
S-100      8             35.27 179.35 59.04      5.09 177 105.51        33.40   2.99    60   79.39   24.37 2.25 98.71 29.89         2.80     44

Table 2: Row-storage experiments (PAX) with a set of FAST and SLOW queries scanning 1%, 10%, 50%
and 100% of a table. 16 streams of 4 random queries. All times in seconds.

                          normal                         attach                         elevator                        relevance

         250

         200
 chunk




         150

         100

         50                                               quot;detachquot;

          0
               0   100     200 300    400 0     50    100 150 200 250 0            50     100 150     200     0    50    100 150       200
                         time (sec)                    time (sec)                       time (sec)                      time (sec)


                      Figure 4: Behavior of different scheduling algorithms: disk accesses over time


both in terms of global system parameters, as well as in most               ized query latency (x axis) of normal, attach and elevator,
query times. As Figure 4 shows, its I/O request pattern is                  with respect to our relevance policy. Each point represents
much more dynamic than in all the other policies. Interest-                 a single run for one policy. The point labels describe runs in
ingly, although relevance issues more I/Os than elevator, it                a format “SPEED-SIZE”, where SPEED defines what query
still results in a better throughput. This is because the sys-              speeds were used (FS - mix of fast and slow, F - only fast,
tem is mostly CPU-bound in this case, and extra available                   FFS - 2 times more fast than slow etc.), and SIZE repre-
I/O time is efficiently used to satisfy further query require-                sents the size of the range being scanned: S - short (mix of
ments. Also, relevance differs from other policies by sig-                   queries reading 1, 2, 5, 10 and 20% of a table), M - mixed
nificantly improving the performance of I/O bound queries.                   (1,2,10,50,100) and L - long (10,30,50,100).
Average query latency is three times better than normal and                    From this experiment we conclude that indeed relevance,
two times better than attach (I/O bound queries like F-100                  representing the (1,1) point in this scatter plot, is consis-
can even be three times faster than attach).                                tently better than all other policies. Recall that our objec-
                                                                            tive was to find a scheduling policy that works well both on
                                                                            the query throughput and on the query latency dimension,
5.2.1          Exploring Many Different Query Mixes                         and this is exactly what is shown in Figure 5. This scatter
   To provide more than accidental evidence of the superior                 plot also allows us to better understand the other policies.
performance of relevance, we conducted experiments with                     We see that normal is inferior on both dimensions, whereas
the same basic settings as in the previous section: 16 streams              elevator gets close to relevance on throughput, but its per-
of 4 queries, TPC-H table with scale factor 10 and buffer size               formance is significantly hindered by poor query latencies.
of 1GB. However, we changed the set of queries to explore                   As for the attach, it does find a balance between through-
two dimensions: range size and data-processing speed. Fig-                  put and latency, but it is consistently beaten by relevance
ure 5 shows the results, where we compare throughput as the                 in both dimensions.
average stream running time (y axis) and average normal-




                                                                     728
CPU-intensive query set                            I/O-intensive query set
                                     7                                                                         F-M                                                                                                       (S-01,S-10,S-50,S-100,                             (F-01,F-10,F-50,F-100)
                                                                                                         F-L                                                                                                              F-01,F-10,F-50,F-100)




                                                                                                                                                                          Number of I/O requests
                                     6                                                                                                                                                                       1600                                             1400
                                                                                                                                                                                                             1400                                             1200
  Average stream time / RELEVANCE

                                                                                                                                                                                                             1200                                             1000
                                     5                                                                                                                                                                       1000
                                                                                                                                                                                                              800
                                                                                                                                                                                                                                                               800
                                                                                                                                                                                                              600                                              600
                                                                                                                                                                                                              400                                              400
                                                                                                                                                                                                              200                                              200
                                     4                                                                                                                                                                          0                                                0
                                                                                                 FFS-M                                                                                                                 20     40        60   80       100                   20        40      60        80   100

                                                                                         FFS-L                                                                                                               160                                              140
                                     3




                                                                                                                                                                          System time (sec)
                                                                          F-L
                                                                                        SF-M                                                                                                                 140                                              120
                                                                                          SF-L                                                                                                               120                                              100
                                                                          F-S
                                                                                                                                                                                                             100
                                    2.5                                                                                                                                                                       80
                                                                                                                                                                                                                                                               80
                                                                                                                                                                                                              60                                               60
                                                         F-M
                                                                        FFS-M                                                                                                                                 40                                               40
                                                                FFS-S
                                                               F-S                                                                                                                                                                                             20
                                     2                                          SSF-L
                                                                                                                                                                                                              20
                                                                                                                                                                                                               0                                                0
                                                         FFS-S SF-S
                                                           FFS-L                        SSF-M
                                                                                                                                                                                                                      20     40        60    80      100                20        40         60         80   100




                                                                                                                                                                          Average normalized query latency
                                                                 SF-L
                                                       SF-M SSF-S
                                                          S-S
                                                                           S-M SF-L
                                    1.5                 SF-S
                                                                S-L                                            F-M
                                                                                                                           SF-M                                                                              12                                               8
                                                                                                               FFS-M                                                                                                                                          7
                                                  S-SSSF-S SSF-L                                                                                                                                             10
                                                            FFS-L
                                                             F-S      SSF-L     SF-S                                                                                                                                                                          6
                                                                                                                               S-M                                                                           8
                                              F-L                 SSF-M                                                SSF-M                                                                                                                                  5
                                    1.2             S-L FFS-S
                                                                S-L
                                                                       S-S    SSF-S                                                                                                                          6                                                4
                                                        S-M
                                                                                                                                                                                                             4                                                3
                                                                                                                                                                                                                                                              2
                                     1                                                                                                                                                                       2                                                1
                                          1   1.2      1.5      2     2.5   3         4    5   6                               7                           8                                                 0                                                0
                                                                                                                                                                                                                    20      40         60    80      100            20           40         60      80       100
                                                    Average normalized query latency / RELEVANCE                                                                                                                           Buffer capacity                                 Buffer capacity
                                                                                                                                                                                                                    (percentage of the table size)                  (percentage of the table size)
                                                normal                             attach                elevator                                                                                                  normal               attach                    elevator                     relevance



Figure 5: Performance of various scheduling policies for query                                                                                                           Figure 6: Behavior of all scheduling poli-
sets varying in processing speed and scanned range                                                                                                                       cies under varying buffer pool capacities


5.2.2                                Scaling The Data Volume                                                                                               25
                                                                                                                                                                      5% scans
                                                                                                                                                                                                                            100
                                                                                                                                                                                                                                        20% scans
                                                                                                                                                                                                                                                                  160
                                                                                                                                                                                                                                                                                 50% scans
                                                                                                                                                                                                normal                                              normal                                    normal
                                                                                                                                                                                                 attach                      90                      attach       140                          attach
                                                                                                                                                                                               elevator                                            elevator                                  elevator
   With growing dataset sizes, the percentage of a relation                                                                                                20                                relevance                       80                  relevance
                                                                                                                                                                                                                                                                  120
                                                                                                                                                                                                                                                                                           relevance
                                                                                                                                   Average query latency




                                                                                                                                                                                                                             70
that can be stored inside the buffer pool decreases. To sim-                                                                                                15                                                                60                                   100
ulate this, we tested the performance of different policies                                                                                                                                                                   50                                    80
under varying buffer size capacities, ranging from 12.5% to                                                                                                 10                                                                40                                    60
100% of the full table size. This allows us to observe how dif-                                                                                                                                                              30
                                                                                                                                                                                                                                                                   40
                                                                                                                                                           5                                                                 20
ferent scheduling policies would behave under growing rela-                                                                                                                                                                                                        20
                                                                                                                                                                                                                             10
tion sizes, when a smaller fraction of a table can be buffered.                                                                                             0                                                                  0                                     0
                                                                                                                                                                1    2   4    8 16 32                                              1    2   4    8 16 32                1    2   4    8 16 32
In this experiment we used a version of our relation trimmed-                                                                                                       Number of queries                                                  Number of queries                    Number of queries
down to 2 GB, that can be fully cached in the memory of our
benchmark machine. Figure 6 shows the results of a bench-                                                                  Figure 7: Performance comparison with varying
mark with two sets of queries, one disk-intensive, consisting                                                              number of concurrent queries and scanned ranges
only of fast queries, and a CPU-intensive one, consisting of a
mix of fast and slow queries. We used 8 streams of 4 queries.
   As expected, the number of I/Os is decreasing with in-                                                                      5.2.3 Many Concurrent Queries
creasing buffer size. In the disk-intensive case, the absolute
system performance is directly influenced by this number,                                                                      With more concurrent queries the opportunities for data-
because the system is never CPU-bound. Still, thanks to                                                                    reuse increase. In Figure 7 we present how the average
better request scheduling, the relevance policy manages to                                                                 query time changes when an increasing number of concur-
improve the performance, issuing significantly fewer I/O re-                                                                rent queries reads 5, 20 and 50% of our relation, using a
quests even when using a 87.5% buffer capacity. In the CPU-                                                                 buffer pool of 1GB. As expected, the benefit of relevance
intensive scenario, the number of I/Os influence the absolute                                                               over normal grows with larger scans and more concurrency.
time only partially. This is because most algorithms man-                                                                  We see that relevance also enhances its advantage over at-
age to make a system CPU-bound with some buffer capacity.                                                                   tach when more queries run concurrently, even exceeding the
For relevance even a small buffer size of 12.5% of the full ta-                                                             factor two observed in Figures 6 and 5, when scan ranges are
ble is enough to achieve this, as we can see by its mostly                                                                 very or moderately selective. As this query set is uniform
constant performance.                                                                                                      in terms of range sizes, elevator can score close to relevance,
   Interestingly, Figure 6 shows that the performance ad-                                                                  but we know from previous experiments that on combina-
vantages of relevance over the other policies as observed in                                                               tions of short and long ranges it is not a viable competitor.
Figure 5 are maximized when tables get bigger (i.e. at low
buffered percentages). When looking at attach, the most
                                                                                                                               5.2.4 Scheduling-cost scalability
viable competitor, we see that throughput in I/O bound sit-                                                                   The cost of scheduling in relevance is significantly higher
uations, as well as latency in CPU bound queries deteriorate                                                               than for other policies. For example, the loadRelevance func-
strongly, and we expect the advantage of relevance to grow                                                                 tion needs to check every query for every table chunk, and
even more if table sizes become huge.                                                                                      do this for each chunk a query requests. Figure 8 presents
                                                                                                                           the average times spent on scheduling when running 16 con-




                                                                                                                 729
100                                                                    0.01                                                                     LINEITEM COLUMNS                                                   DISK PAGE




                                                              Fraction of the execution time




                                                                                                                                                  orderkey


                                                                                                                                                                           partkey


                                                                                                                                                                                             returnflag


                                                                                                                                                                                                               extendedprice



                                                                                                                                                                                                                               comment




                                                                                                                                                                                                                                                   chunk 0
  Scheduling time (ms)


                          10
                                                                                               0.001

                           1




                                                                                                                                                                                                                                                   chunk 1
                                                                                               1e-04
                          0.1



                         0.01                                                                  1e-05




                                                                                                                                                  PFOR−DELTA(oid):3bit




                                                                                                                                                                                                                                                   chunk 2 chunk 3
                                128  256 512 1024 2048                                                 128   256 512 1024 2048
                                     Number of chunks                                                       Number of chunks
                                (chunk size = 2GB / number)                                            (chunk size = 2GB / number)




                                                                                                                                                                           PFOR(oid):21bit



                                                                                                                                                                                             PDICT(str):2bit
                           1% scan           100% scan                                           1% scan            100% scan




                                                                                                                                                                                                               decimal:64bit
                          10% scan                                                              10% scan




                                                                                                                                                                                                                               str:256bit
Figure 8: Scheduling time and fraction of execution
time when querying 1%, 10% and 100% of a 2GB
relation with varying chunk size / number
                                                                                                                                           Figure 9: Compressed column storage: more com-
current streams of 4 I/O-bound queries, each with the same                                                                                 plex logical chunk – physical page relationships
relation stored in a different number of chunks of varying
sizes. As expected, the overhead grows super-linearly - with                                                                               The underlying storage manager should provide an efficient
smaller chunks, every query needs to scan more of them, and                                                                                means to tell which pages store data from a chunk. De-
the decision process for each data request needs to consider                                                                               pending on the physical data representation, a single logical
more chunks. Still, even with the largest tested number of                                                                                 chunk can consist of multiple physical pages, and a single
chunks, the scheduling overhead in the worst case does not                                                                                 physical page can contain data for multiple logical chunks.
exceed 1% of the entire execution time. In situations when                                                                                    This logical-physical mismatch present in DSM becomes
such overhead is not acceptable, e.g. with relations consist-                                                                              even more problematic when using large physical blocks of a
ing of hundreds of thousands of chunks, slightly less complex                                                                              fixed size for I/O, a technique introduced in NSM for good
policies can be considered. Also, our relevance implementa-                                                                                concurrent bandwidth. The first problem here is that when
tion is rather naive, leaving opportunities for optimizations                                                                              loading a block for one chunk, a potentially large amount of
that can significantly reduce the scheduling cost.                                                                                          data from a neighboring chunk can be loaded at the same
                                                                                                                                           time. In NSM this does not occur, as chunks and blocks
                                                                                                                                           are equivalent. In DSM, however, ABM needs to take spe-
6. IMPROVING DSM SCANS                                                                                                                     cial care to minimize situations in which this extra data is
  After our successful experiments with relevance in row-                                                                                  evicted before it could be used by a different chunk. Also,
storage, we now turn our attention to column-stores. The                                                                                   keeping a full physical block in memory to provide it to an-
decomposed storage model (DSM) has recently gained pop-                                                                                    other query in the near future may result in a sub-optimal
ularity for its reduced disk-bandwidth needs, faster query                                                                                 buffer usage. The second problem is the buffer demand: in
processing thanks to improved data locality [3], possibility                                                                               NSM, for Q concurrent queries the system requires 2 ∗ Q
of vectorized processing [5, 4] and additional compression                                                                                 (factor 2 because of prefetching) blocks, which is usually ac-
opportunities [33, 1]. While we will show that relevance can                                                                               ceptable, e.g. 512MB for 16 concurrent scans using 16MB
also be successful here, we first discuss why DSM is much                                                                                   chunks/blocks. In DSM, however, to process a set of rows,
more complex than NSM when it comes to scans in general                                                                                    data from multiple columns needs to be delivered. While
and I/O scheduling in particular, and how this influenced                                                                                   some optimizations are possible, e.g. performing selections
our Cooperative Scans framework.                                                                                                           on some columns early [13], in general all the columns used
                                                                                                                                           by a query need to be loaded before the processing starts.
6.1                       DSM Challenges                                                                                                   As a result, a separate block is needed for every column
   Table columns stored using DSM may differ among each                                                                                     used in a query, increasing the buffer demand significantly
other in physical data representation width, either because                                                                                for multi-column scans, e.g. to 4GB for 16 scans reading 8
of the data types used, or because of compression. For exam-                                                                               columns each. This can be improved (in both models) by
ple, Figure 9 depicts column storage of a part of the TPC-H                                                                                analyzing which pages from blocks are already processed,
lineitem table, with some columns compressed with tech-                                                                                    and re-using them as-soon-as-possible for I/Os for different
niques presented in [33]. This shows that we can not as-                                                                                   queries (with scatter-gather I/O). Both problems demon-
sume a fixed number of tuples on a disk page, even within                                                                                   strate that in DSM the performance and resource demand
a single column. As a result, a chunk cannot consist of a                                                                                  can be significantly improved by making algorithms page-
fixed number of disk pages as in NSM. Instead, chunks are                                                                                   aware. However, such solutions significantly complicate the
logical concepts, i.e. a horizontal partitioning of the table                                                                              implementation, and our current system currently handles
on the tuple granularity. For example, one may divide a                                                                                    only chunk-level policies.
table in conceptual chunks of a 100.000 tuples, but it is also                                                                                The final DSM problem for Cooperative Scans is the re-
possible to use variable-size chunks, e.g. to make the chunk                                                                               duced data reuse opportunity between queries. Figure 10
boundary always match some key boundary. This implies                                                                                      shows two queries reading a subset of a table and their I/O
that chunk boundaries do not align with page boundaries.                                                                                   requirements in both NSM and DSM. Comparing the logical




                                                                                                                                     730
Logical level                       NSM                             DSM
            a   b   c   d   e   f   g   h   a   b   c   d   e   f   g   h   a   b   c   d   e   f   g   h
                                                                                                              one logical chunk contains data useful for neighboring chunks.
 query 1
                                                                                                              When the first chunk is freed, this data would be evicted. To
 query 2                                                                                                      avoid that, the choice of the next chunk for a given query
                                                                                                              is performed before the query blocks for a fully available
 overlap
                                                                                                              chunk. The already-loaded part of the chunk is marked as
                                                                                                              used, which prohibits its eviction.
 values                                                                                                       finding space for a chunk – in DSM it is possible that a
                                                                                                              subset of columns in a buffered chunk is not useful for any
I/O units
                                                                                                              query. ABM first evicts blocks belonging to such columns.
                                                                                                              Then, it starts evicting useful chunks, using the keepRele-
   Figure 10: Data overlapping in NSM and DSM                                                                 vance function to victimize the least relevant chunk. Note
                                                                                                              that, unlike in NSM, this eviction process is iterative, since
      useRelevance(c, qtrigger )                                                                              due to different physical chunk sizes, possibly multiple chunks
      | cols = queryColumns(qtrigger )                                                                        need to be freed.1
      | U = |interestedOverlappingQueries(c, cols)|                                                           column-aware relevance functions – Figure 11 shows
      | Pu = numberCachedPages(c, cols)                                                                       that the DSM relevance functions need to take into account
      | return Pu /U
                                                                                                              the two-dimensional nature of column storage and the vary-
      loadRelevance(c, qtrigger )
      | query cols = queryColumns(qtrigger )                                                                  ing physical chunk sizes. Like in NSM, useRelevance at-
      | queries = overlappingStarvedQueries(c, query cols)                                                    tempts to use chunks needed by few queries, to make them
      | cols = columnsUsedInQueries(queries)                                                                  available for eviction. However, it also analyzes the size of
      | L = |queries|                                                                                         a chunk, to additionally promote chunks occupying more
      | Pl = |columnPagesToLoad(c, cols)|
      | return L/Pl                                                                                           buffer space. The loadRelevance function looks at the num-
      keepRelevance(c, qtrigger )                                                                             ber of starved queries that overlap with a triggering query
      | starved = almostStarvedQueries(c)                                                                     and are interested in a given chunk. It also estimates the
      | cols = columnsUsedInQueries(starved)                                                                  cost of loading a given chunk by computing the number of
      | E = |starved|
      | Pe = |cachedColumnPages(c, cols)|                                                                     cold pages required for all needed columns. The returned
      | return E/Pe                                                                                           score promotes chunks that benefit multiple starved queries,
                                                                                                              and require a small amount of I/O. The DSM keepRelevance
            Figure 11: DSM Relevance Functions                                                                function promotes keeping chunks that occupy little space
                                                                                                              in the buffer pool and are useful for many queries.
                                                                                                              column loading order – a final issue in DSM is the order
data need to the physical data demand in both models, we                                                      of columns when loading a chosen chunk. If some queries
see that the vertical expansion present in DSM is usually                                                     depend only on a subset of the columns, it may be benefi-
significantly smaller than the horizontal expansion present                                                    cial to load that subset first. Our current crude approach
in NSM. As a result, fewer disk blocks are shared between                                                     is to just load column chunks in increasing size (in terms
the queries, reducing the chance of reusing the same block                                                    of pages), which maximizes the number of “early” avail-
for different scans. In NSM, for a block fetched by some                                                       able columns, allowing queries to be awoken earlier. An-
query Q1 , the probability that another query Q2 , reading                                                    other approach could prioritize columns that faster satisfy
                                           T2
T2 tuples, will use it is proportional to TT , where TT is the                                                some query needs. Finally, if data is compressed on disk but
number of tuples in the entire table. In DSM, we need to                                                      kept decompressed in the buffer manager (like in SybaseIQ),
take into account both vertical (as in NSM) and horizontal                                                    it might be valuable to first load compressed columns, so
                                        T2                                                                    their decompression is interleaved with loading the remain-
overlap, reducing this probability to TT ∗ Poverlap (Q1 , Q2 ),
where Poverlap (Q1 , Q2 ) is the probability of a column from                                                 ing ones.
Q1 also being used in Q2 .
                                                                                                              6.3 DSM Results
6.2         Cooperative Scans in DSM                                                                            Table 3 presents DSM results for an experiment similar
   The DSM implementation of the traditional policies is                                                      to the one presented in Table 2 for NSM/PAX. One differ-
straightforward. In normal, the order of I/Os is strictly de-                                                 ence is that we used a faster “slow” query, since, due to the
termined by the query and LRU buffering is performed on a                                                      faster scanning achieved by DSM, with the original query the
(chunk,column) level. DSM attach joins a query with most                                                      system was completely CPU bound, making it impossible
overlap, where a crude measure of overlap is the number of                                                    to demonstrate performance differences of different policies
columns two queries have in common. A more fine-grained                                                        with these queries. Also, we increased the lineitem size from
measure would be to get average page-per-chunk statistics                                                     factor 10 ( 60Mtuples) to 40 ( 240Mtuples), to compensate
for the columns of a table, and use these as weights when                                                     for the lower data-volume demand of DSM. Finally, since
counting overlapping columns. Just like in NSM, the DSM                                                       our current implementation requires reserved memory for
elevator policy still enforces a global cursor that sequentially                                              each active chunk, and there are more chunks in DSM (one
moves through the table. Obviously, it only loads the union                                                   for each column), we had to increase the buffer size from
of all columns needed for this position by the active queries.                                                1GB to 1.5GB to allow concurrent execution of 16 queries.
   The framework for relevance in DSM is similar to that                                                      1
                                                                                                                If multiple chunks need to be freed, the dependency be-
in NSM, with a few crucial differences, caused by the chal-                                                    tween them should be taken into account, something missed
lenges discussed in the previous section:                                                                     by the greedy iterative approach used here. Choosing the
avoiding data waste – as discussed, with I/O based on                                                         optimal set of chunks to free is a good example of a knapsack
large physical blocks, it is possible that a block loaded for                                                 problem surfacing in DSM I/O scheduling.




                                                                                                        731
p723-zukowski
p723-zukowski
p723-zukowski

Contenu connexe

Tendances

THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...IJCNCJournal
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmTA Minh Thuy
 
Method and apparatus for preconditioning data to be transferred on a switched...
Method and apparatus for preconditioning data to be transferred on a switched...Method and apparatus for preconditioning data to be transferred on a switched...
Method and apparatus for preconditioning data to be transferred on a switched...Tal Lavian Ph.D.
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEEGLOBALSOFTSTUDENTPROJECTS
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
Nonblocking k compare single swap : Paper reading
Nonblocking k compare single swap : Paper readingNonblocking k compare single swap : Paper reading
Nonblocking k compare single swap : Paper readingSubhajit Sahu
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYcaijjournal
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...ijwmn
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEijdpsjournal
 
Policy Driven Dynamic LUN Space Optimization based on the Utilization
Policy Driven Dynamic LUN Space Optimization based on the UtilizationPolicy Driven Dynamic LUN Space Optimization based on the Utilization
Policy Driven Dynamic LUN Space Optimization based on the Utilizationidescitation
 

Tendances (17)

Ntp4
Ntp4Ntp4
Ntp4
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithm
 
Method and apparatus for preconditioning data to be transferred on a switched...
Method and apparatus for preconditioning data to be transferred on a switched...Method and apparatus for preconditioning data to be transferred on a switched...
Method and apparatus for preconditioning data to be transferred on a switched...
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
 
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
ICICCE0298
ICICCE0298ICICCE0298
ICICCE0298
 
Nonblocking k compare single swap : Paper reading
Nonblocking k compare single swap : Paper readingNonblocking k compare single swap : Paper reading
Nonblocking k compare single swap : Paper reading
 
Cloud Spanner
Cloud SpannerCloud Spanner
Cloud Spanner
 
Bulk transfer scheduling and path reservations in research networks
Bulk transfer scheduling and path reservations in research networksBulk transfer scheduling and path reservations in research networks
Bulk transfer scheduling and path reservations in research networks
 
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORYMAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
MAC: A NOVEL SYSTEMATICALLY MULTILEVEL CACHE REPLACEMENT POLICY FOR PCM MEMORY
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...
SELFLESS DISTRIBUTED CREDIT BASED SCHEDULING FOR IMPROVED QOS IN IEEE 802.16 ...
 
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGEADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
ADVANCED DIFFUSION APPROACH TO DYNAMIC LOAD-BALANCING FOR CLOUD STORAGE
 
Policy Driven Dynamic LUN Space Optimization based on the Utilization
Policy Driven Dynamic LUN Space Optimization based on the UtilizationPolicy Driven Dynamic LUN Space Optimization based on the Utilization
Policy Driven Dynamic LUN Space Optimization based on the Utilization
 

En vedette

Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftKnowit Oy
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceTarek El Jammal
 

En vedette (6)

Efficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search EnginesEfficient Query Processing in Geographic Web Search Engines
Efficient Query Processing in Geographic Web Search Engines
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
Modern Workplace: Office 2016
 Modern Workplace: Office 2016 Modern Workplace: Office 2016
Modern Workplace: Office 2016
 
Modern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, MicrosoftModern Workplace 2016 - Susanna Eerola, Microsoft
Modern Workplace 2016 - Susanna Eerola, Microsoft
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Office 365 - Your Modern Workplace
Office 365 - Your Modern WorkplaceOffice 365 - Your Modern Workplace
Office 365 - Your Modern Workplace
 

Similaire à p723-zukowski

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...IEEEFINALSEMSTUDENTPROJECTS
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...IEEEFINALYEARSTUDENTPROJECT
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...csandit
 
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...cscpconf
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlesSoundar Msr
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 

Similaire à p723-zukowski (20)

Cjoin
CjoinCjoin
Cjoin
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A stochastic model to investigate data...
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...A quality of service management in distributed feedback control scheduling ar...
A quality of service management in distributed feedback control scheduling ar...
 
SPROJReport (1)
SPROJReport (1)SPROJReport (1)
SPROJReport (1)
 
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...
QUALITY OF SERVICE MANAGEMENT IN DISTRIBUTED FEEDBACK CONTROL SCHEDULING ARCH...
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 

Plus de Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

Plus de Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

p723-zukowski

  • 1. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski, Sandor Heman, Niels Nes, Peter Boncz ´ ´ Firstname.Lastname@cwi.nl Centrum voor Wiskunde en Informatica Kruislaan 413, 1098 SJ Amsterdam, The Netherlands ABSTRACT policy is employed, while in both attach and elevator incom- This paper analyzes the performance of concurrent (index) ing queries can join an ongoing scan in case there is overlap scan operations in both record (NSM/PAX) and column in data need, with the main difference that elevator employs (DSM) disk storage models and shows that existing schedul- a single, strictly sequential scan cursor, while attach allows ing policies do not fully exploit data-sharing opportunities for multiple (shared) cursors. Benchmarks show that they and therefore result in poor disk bandwidth utilization. We provide sharing of disk bandwidth and buffer space only in propose the Cooperative Scans framework that enhances per- a limited set of scenarios. This is mostly caused by the fact formance in such scenarios by improving data-sharing be- that the disk access order is predefined when a query en- tween concurrent scans. It performs dynamic scheduling of ters the system, hence it can not be adjusted to optimize queries and their data requests, taking into account the cur- performance in dynamic multi-query scenarios. rent system situation. We first present results on top of Scan Scan Scan CScan CScan CScan an NSM/PAX storage layout, showing that it achieves sig- nificant performance improvements over traditional policies Buffer Active in terms of both the number of I/Os and overall execution Manager Buffer Manager time, as well as latency of individual queries. We provide benchmarks with varying system parameters, data sizes and query loads to confirm the improvement occurs in a wide pages chunk range of scenarios. Then we extend our proposal to a more complicated DSM scenario, discussing numerous problems related to the two-dimensional nature of disk scheduling in Figure 1: Normal Scans vs Cooperative Scans column stores. To overcome these limitations, we introduce the Cooper- ative Scans framework, depicted in Figure 1. It involves 1. INTRODUCTION CScan – a modified (index) Scan operator, that announces In traditional database research disk scans were mostly the needed data ranges upfront to an active buffer manager considered trivial, and simple LRU or MRU buffering poli- (ABM). The ABM dynamically optimizes the order of disk ac- cies were proposed for them [6, 23]. We show that if scans cesses, taking into account all current CScan requests on a start at different times, these policies achieve only a low relation (or a set of clustered relations). This framework can amount of buffer reuse. To improve this situation, some run the basic normal, attach and elevator policies, but also systems support the concept of circular scans [9, 10, 20, 14] a new policy, relevance, that is central to our proposal. Be- which allows queries that start later to attach themselves to sides optimizing throughput, the relevance policy also min- already active scans. As a result, the disk bandwidth can be imizes latency. This is done by departing from the strictly shared between the queries, resulting in a reduced number sequential access pattern as present in attach and elevator. of I/O requests. However, this strategy is not efficient when Instead, relevance makes page load and eviction decisions queries process data at different speeds or a query scans only based on per-page relevance functions, which, for example, a range instead of a full table. try to evict pages with a low number of interested queries In this paper we analyze the performance of existing scan as soon as possible, while prioritizing page reads for short strategies, identifying three basic approaches: normal, at- queries and pages that have many interested queries. tach and elevator. In normal, a traditional LRU buffering To further illustrate the need for a more flexible approach to I/O scheduling, consider the following example. Assume a system has to execute two queries, Q1 and Q2 , which enter Permission to copy without fee all or part of this material is granted provided the system at the same time and process data at the same that the copies are not made or distributed for direct commercial advantage, speed. Q1 needs to read 30 pages and is scheduled first, while the VLDB copyright notice and the title of the publication and its date appear, Q2 needs 10 different pages. If those queries get serviced in and notice is given that copying is by permission of the Very Large Data a round-robin fashion, as in the normal policy, Q2 finishes Base Endowment. To copy otherwise, or to republish, to post on servers after 20 pages are loaded, and Q1 after 40, giving an average or to redistribute to lists, requires a fee and/or special permission from the query latency of 30. The elevator policy may perform better, publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. by first fully servicing Q2 and then Q1 , reducing the average Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. waiting time from 30 to 25. Still, elevator can choose the 723
  • 2. opposite order, resulting in waiting times of 30 and 40, hence The underlying hardware trends of the past decades, namely actually increasing the average time. With relevance, we aim sustained exponential growth in CPU power as well as much to get close to the optimal average query latency, without faster improvement in disk-bandwidth than in I/O latency, relying on the sequential scan order, by making flexible I/O are expected to continue. Thus, to keep each next CPU gen- scheduling decisions. eration with more cores busy, the number of disks will need Contributions. We view our contributions as follows: to be doubled to achieve system balance. (i) the definition of the Cooperative Scans framework, and This exponential trend is clearly unsustainable, and one the new relevance policy. (ii) experiments that help better can argue that in the real world (i.e. outside manufacturer understand the behavior of the existing normal, attach and benchmarking projects) it is already no longer being sus- elevator scheduling policies, and also show that the relevance tained, and database servers are often configured with fewer policy outperforms them in a wide range of scenarios, both disks than optimal. The main reason for this is cost, both in for row- and column-stores. (iii) showing that I/O schedul- terms of absolute value of large I/O subsystems (nowadays ing for column-stores is significantly more complex than for taking more than two thirds of TPC-H benchmark systems row-storage; a characteristic that so far seems to have been cost, see Table 1), but also maintenance costs. In a multi- overlooked. thousand disk configuration, multiple disks are expected to Outline. In Section 2 we motivate our research, arguing break each day [25], which implies the need for full-time that efficiently handling (index) scans is an important topic attendance by a human system administrator. and will become even more so in the future, given current Better Scanning. We argue that the only way to avoid hardware and application trends. Section 3 analyzes exist- random I/O is to rely more on (clustered index) scans, which ing approaches to scan processing. In Section 4 we intro- depend on sequential disk bandwidth rather than latency. duce the Cooperative Scans framework for row stores and Modern data warehousing systems try to achieve this by: we validate its performance in Section 5. In Section 6 we (1) storing relations redundantly in multiple orders [2], such extend Cooperative Scans to column stores, that recently that more query patterns can use a clustered access path. gained popularity. The incorporation of ABM into an exist- To avoid the costs of updating multiple such tables, updates ing DBMS is discussed in Section 7, where we also explore in such systems are buffered in RAM in differential lists and possibilities of adapting order-aware query processing opera- are dynamically merged with persistent data. tors to handle out-of-order data delivery. We discuss related (2) exploiting correlated orderings. In MonetDB/X100, a work in Section 8 before concluding in Section 9. min- and max-value is kept for each column per large disk block. Such meta-data, similar to “small materialized ag- 2. MOTIVATION gregates” [19] and also found e.g. in the Netezza system as “zonemaps” [21], allows avoiding reading unneeded blocks Database systems are addicted to random disk I/O, caused during an (index) scan, even if the data is not ordered on by unclustered index lookups, and hardware trends are push- that column, but on a correlated column. For example, in ing this model to the limit of sustainability. Foreign-key the lineitem table in the TPC-H schema it allows avoiding joins, as well as selection predicates executed using unclus- I/O for almost all non-relevant tuples in range selections on tered indices both may yield large streams of row-IDs for any date column in the TPC-H schema, as dates in the fact looking up records in a target table. If this target table is tables of a data warehouse tend to be highly correlated. This large and the accesses are scattered, the needed disk pages technique can sometimes result in a scan-plan that requires will have to be fetched using random disk I/O. To optimize a set of non-contiguous table ranges. performance, industry-strength RDBMSs make good use of (3) using multi-table clustering or materialized views, to ex- asynchronous I/O to farm out batches of requests over mul- ploit index range-scans even over foreign-key joins. tiple disk drives, both to achieve I/O parallelism between (4) exploiting large RAMs to fully buffer small (compressed) the drives, and to let each disk handle multiple I/Os in a relations, e.g. the dimensions of a star schema. single arm movement (amortizing some access latency). (5) reducing scan I/O volume by offering column storage (DSM) as an option [33, 28] to avoid reading unused columns. processing disks throughput # CPU RAM # totsize cost single 5-way The same remark as made in (1) on handling updates ap- 4 Xeon 3.0GHz dual-core 64GB 124 4.4TB 47% 19497 10404 plies here. However, I/O scheduling in column stores can be 2 Opteron 2GHz 48GB 336 6.0TB 80% 12941 11531 significantly more complex, as shown in Section 6. 4 Xeon 3.0GHz dual-core 32GB 92 3.2TB 67% 11423 6768 (6) using lightweight compression, where the reduced I/O 2 Power5 1.65GHz dual-core 32GB 45 1.6TB 65% 8415 4802 cost due to size reduction outweighs the CPU cost of de- compression. It has been shown that with column storage, Table 1: Official 2006 TPC-H 100GB results (de-)compression becomes less CPU intensive and achieves better compression ratios [33, 1]. Massive unclustered disk access occurs frequently in bench- (7) using on-disk processing for reducing disk-memory traf- marks like TPC-H, and it is not uncommon now to see fic and main-CPU load. For example, Netezza uses pro- benchmark configurations that use hundreds or thousands grammable FPGAs [21] to quickly perform selection on data of disks. For example, Table 1 shows that the four most before it gets to the main memory. recent TPC-H submissions of even the smallest 100GB data After applying (1-2), data warehousing queries use (clus- size used an average 150 disks with total storage capacity of tered index) scans for their I/O, typically selecting ranges 3.8 terabyte. All these disks are less than 10% full, and the from the fact tables. Other table I/O can be reduced using main reason for their high number is to get more disk-arms, (3-4) and I/O bandwidth can be optimized to its minimum allowing for a higher throughput of random I/O requests. cost by (5-7). The challenge addressed here is that if this Note from Table 1 that a high number of disks seems es- mode of operation is to be successful, the DBMS must be pecially crucial in the concurrent (5 stream) query scenario. 724
  • 3. Probability of finding a useful chunk 1 not try to reuse data shared by different running queries. In a dynamic environment, with multiple partial scans running 0.8 at the same time, it is likely that the buffer pool contains some data that is useful for a given query. With a table 0.6 consisting of CT chunks, a query that needs CQ chunks and 0.4 a buffer pool of CB chunks, the probability of finding some 50% buffered useful data in the randomly-filled buffer is: 20% buffered 0.2 10% buffered 5% buffered 1% buffered CB −1 0 Y CT − CQ − i 0 10 20 30 40 50 60 70 80 90 100 Preuse = 1 − (1) Chunks needed by a query (out of 100) i=0 CT − i Figure 2: Probability of finding a useful chunk in a randomly-filled buffer pool, with varying buffer pool As Figure 2 shows, even for small scanned ranges and buffer size and query demand sizes, this probability can be high, e.g. over 50% for a 10% scan with a buffer pool holding 10% of the relation. Unfor- tunately, the normal policy, by enforcing a sequential order capable of handling many concurrent scan queries efficiently. of data delivery, at a given time can use only a single page, However, unlike asynchronous random I/O, where concur- reducing this probability to CB /CT . rent batches of random I/Os may allow for better amortized In many cases it is possible to relax the requirement of se- disk seeks thanks to request reordering, the naive approach quential data delivery, imposed by normal. Even when using to concurrent scans causes queries to fight for sequential read a clustered index for attribute selection, consuming opera- bandwidth, reducing overall throughput. We should note tors often do not need data in a particular order. This allows that we do not focus on overall query throughput alone as for scheduling policies with “out-of-order” data delivery. our efficiency metric: it is equally important that average A simple idea of sharing disk access between the over- query latency is minimized. The rationale here is that in lapping queries is used in the attach strategy. When a many scenarios, for example when an application submits query Qnew enters the system, it looks at all other running queries one after another, query latencies significantly influ- scans, and if one of them (Qold ) is overlapping, it starts ence the execution time of a query stream and hence the to read data at the current Qold ’s position. To optimize observed system throughput. performance, attach should choose a query that has the Our work exploits the observation that concurrent scans largest remaining overlap with Qnew . Once Qnew reaches have a high expected amount of overlapping data need. This the end of its desired range, it starts from the beginning creates the opportunity for synchronizing queries, such that until reaching the original position. This policy, also known they share buffered data and thus reduce the demand for as “circular scans” or “shared scans”, is used among oth- disk bandwidth. In the following we discuss how existing ers in Microsoft SQLServer [10], RedBrick [9], and Tera- systems exploit this opportunity and propose a new tech- data [20], and allows significant performance improvement nique that further improves bandwidth sharing. in many scenarios. The attach policy, however, may suffer from three problems. First, if one query moves much faster than the other, the gap between them may become so large 3. TRADITIONAL SCAN PROCESSING that pages read by the fastest query are swapped out before With multiple scans running concurrently, in a naive im- the slower reaches them (they “detach”). Second, if queries plementation sequential requests from different queries can are range scans, it is possible that one of the queries that interleave, causing frequent disk-arm movements and result- process data together finishes, and the other continues by ing in a semi-random access pattern and low overall disk itself, even though it could attach to another running query. throughput. To avoid this problem, most database systems For example, if a full scan is underway but not yet in its execute such scans using large isolated I/O requests span- range, attach misses this sharing opportunity. Finally, when ning over multiple pages, together with physical clustering exploiting per-block meta-data, the scan request can consist of table pages. As a result, the overhead of shifting the disk- of multiple ranges, making it even harder to benefit from arm is amortized over a large chunk of data, resulting in an sharing a scan with a single query. As a result, the upper overall bandwidth comparable to a standalone scan. bound on the number of I/Os performed by attach is the Even when using bandwidth-efficient chunk-based I/O, same as in normal. different scheduling policies are used for concurrent scans. The elevator policy is a variant of attach that addresses The most naive, called normal in the rest of the paper, its problems by enforcing strict sequential reading order of performs scans by simply reading all disk blocks requested the chunks for the entire system. This optimizes the disk la- by a query in a sequential fashion, using an LRU policy for tency and minimizes the number of I/O requests, and thus buffering. The disadvantage of LRU is that if one query leads to good disk bandwidth and query throughput. How- starts too long after the other, the loaded pages will already ever, the problem here is that query speed degenerates to the be swapped out before they can be reused. As a result, speed of the slowest query, because all queries wait for each assuming there is no buffer reuse between the queries, and other. Also, range queries often need to wait a long time be- queries are serviced in a round-robin fashion, the expected fore the reading cursor reaches the data that is interesting number of I/Os performed in the system until a new query for them. In principle, in the worst case the number of I/Os Qnew that reads Cnew chunks finishes can be estimated by: P performed by a system before a fresh query Qnew finishes P Cnew + q∈queries M IN (Cnew , Cq ). can be M IN (CT , Cnew + q∈queries Cq ), where CT is the The major drawback of the normal policy is that it does number of chunks in the entire table. 725
  • 4. CScan process 4. COOPERATIVE SCANS selectChunk(qtrigger ) The analysis in the previous section, further confirmed | if finished(qtrigger ) by results in Section 5, demonstrates that existing scan- | | return NULL | else processing solutions that try to improve over the normal | | if abmBlocked() policy still suffer from multiple inefficiencies. In this sec- | | | signalQueryAvailable() tion we propose a new “Cooperative Scans” framework that | | chunk = chooseAvailableChunk(qtrigger ) | | if (chunk == NULL) avoids these problems. As Figure 1 presents, it consists of a | | | chunk = waitForChunk(qtrigger ) cooperative variant of the traditional (index) Scan operator, | | return chunk named CScan, and an Active Buffer Manager (ABM). chooseAvailableChunk(qtrigger ) The new CScan operator registers itself as an active scan | cavailable = NULL, U = 0 on a range or a set of ranges from a table or a clustered | foreach c in interestingChunks(qtrigger ) | | if chunkReady(c) and useRelevance(c) > U index. CScan has much the same interface as the normal Scan | | | U = useRelevance(c) operator, but it is willing to accept that data may come in a | | | cavailable = c different order. Note that some query plans exploit column | return cavailable ordering present on disk. We discuss integration of such ABM process queries in our framework in Section 7. main() | while (true) The Active Buffer Manager (ABM) extends the traditional | | query = chooseQueryToProcess() buffer manager in that it keeps track of CScan operators and | | if query == NULL which parts of the table are still needed by each of them, | | | blockForNextQuery() | | | continue and tries to schedule disk reads such that multiple concur- | | chunk = chooseChunkToLoad(query) rent scans reuse the same pages. The overall goal of the ABM | | slot = findFreeSlot(query) is to minimize average query cost, keeping the maximum | | loadChunk(chunk, slot) | | foreach q in queries query execution cost reasonable (i.e. ensuring “fair” treat- | | | if (chunkInteresting(q, chunk) and queryBlocked(q) ment of all queries). As discussed before, scan processing is | | | | signalQuery(q, chunk) usually performed with large I/O units we call chunks, to chooseQueryToProcess() achieve good bandwidth with multiple concurrent queries. | relevance = −∞, query = NULL Note, that a chunk in memory does not have to be contigu- | foreach q in queries | | qr = queryRelevance(q) ous, as it can consists of multiple pages filled in with a single | | if (query == NULL or qr > relevance) scatter-gather I/O request. In our framework there are two | | | relevance = qr more reasons for using chunks. First, the number of chunks | | | query = q | return query is usually one or two orders of magnitude smaller than the number of pages, thus it becomes possible to have chunk- chooseChunkToLoad(qtrigger ) | cload = NULL, L = 0 level scheduling policies that are considerably more complex | foreach c in interestingChunks(qtrigger ) than page-level policies. Secondly, chunks are logical enti- | | if (not chunkReady(c)) and loadRelevance(c) > L ties whose boundaries may not even correspond exactly to | | | L = loadRelevance(c) page boundaries, a feature that will be exploited in the more | | | cload = c | return cload complex scenarios with column-based storage. findFreeSlot(qtrigger ) In our system, the Cooperative Scans framework imple- | sevict = NULL, K = ∞ ments the traditional scan-processing policies: normal, at- | foreach s in slots tach and elevator. However, its main benefit comes from | | if empty(s) a newly introduced relevance policy that takes scheduling | | | return s | | c = chunkInSlot(s) decisions by using a set of relevance functions. Both the | | if (not currentlyUsed(s)) and (not interesting(c, qtrigger )) CScan and ABM processes, as well as the relevance functions | | and (not usefulForStarvedQuery(c)) used by them, are described in Figure 3. | | and keepRelevance(c) < K | | | K = keepRelevance(c) As the top part of Figure 3 illustrates, the CScan process is | | | sevict = s called on behalf of a certain query, qtrigger , that contains a | freeSlot(sevict ) CScan operator in its query plan. Each time qtrigger needs a | return sevict chunk of data to process, selectChunk is called. This triggers NSM Relevance Functions a search over all buffered chunks that still need to be pro- queryRelevance(q) | if not queryStarved(q) cessed by the query, in chooseAvailableChunk, and returns | | return −∞ the most relevant one, as governed by useRelevance. If no | return - chunksNeeded(q) + such chunk is available, the operator blocks until the ABM | | waitingTime(q) / runnningQueries() process loads a chunk that is still needed by qtrigger . Our useRelevance(c, qtrigger ) useRelevance function promotes chunks with the smallest | return Qmax − numberInterestedQueries(c) number of interested queries. By doing so, the less interest- loadRelevance(c) | return numberInterestedStarvedQueries(c) * Qmax ing chunks will be consumed early, making it safe to evict | + numberInterestedQueries(c) them. This also minimizes the likelihood that less interest- keepRelevance(c, qtrigger ) ing chunks will get evicted before they are consumed. | return numberInterestedAlmostStarvedQueries(c) * Qmax The ABM thread continuously monitors all currently run- | + numberInterestedQueries(c) ning queries and their data needs. It schedules I/O requests queryStarved(qtrigger ) on behalf of the query with the highest priority, considering | return numberOfAvailableChunks(qtrigger ) < 2 the current system state. For this query, it chooses the most relevant chunk to load, possibly evicting the least relevant Figure 3: Pseudo-code for the Relevance policy 726
  • 5. chunk present in the buffer manager. Once a new chunk benchmark data with scale factor 10. In this setting the is loaded into the ABM, all blocked queries interested in that lineitem table consumes over 4GB of disk space. The other chunk are notified. This is the core functionality of ABM’s tables are fully cached by the system. main loop, as found in the middle part of Figure 3. Queries: To allow flexible testing of our algorithms we have The chooseQueryToProcess call is responsible for finding chosen two queries based on the TPC-H benchmark. Query the highest priority query, according to queryRelevance, to FAST (F) is TPC-H Q6, which is a simple aggregation. load a chunk for. This queryRelevance function considers Query SLOW (S) is TPC-H Q1 with extra arithmetic com- non-starved queries (i.e. a queries that have 2 or more avail- putations to make it more CPU intensive. For all queries we able chunks, including the one they are currently processing) allow arbitrary scaling of the scanned table range. In this equal, assigning them the lowest priority possible. Starved section we use the notation QUERY-PERCENTAGE, with queries are prioritized according to the amount of data they QUERY representing the type of query, and PERCENTAGE still need, with shorter queries receiving higher priorities. the size of the range being scanned. For example, with F- However, to prevent the longer queries from being starved 10 we denote query FAST, reading 10% of the full relation forever, the priorities are adjusted to also promote queries from a random location. We use multiple query streams, that are already waiting for a long time. By prioritizing each sequentially executing a random set of queries. There short queries, ABM tries to avoid situations where chunks are is a 3 second delay between starting the streams, to better assigned to queries in a round-robin fashion, as this can simulate queries entering an already-working system. have a negative impact on query latency. Besides, a chunk loaded for a short query has a higher chance of being useful 5.2 Comparing scheduling policies to some large query than the other way around. In case Table 2 shows the results for all scheduling policies when ABM does not find a query to schedule a chunk for, it blocks running 16 streams of 4 queries. We used a mix of slow in blockForNextQuery, until the CScan operator wakes it up and fast queries with selectivity of 1%, 10%, 50% and 100%. again using signalQueryAvailable. The two major system-wide results are the average stream Once ABM has found a query to schedule a chunk for, it calls running time, that represents the system throughput, and the chooseChunkToLoad routine to select a not yet loaded the average normalized latency of a query (running time in chunk that still needs to be processed by the selected query. this benchmark divided by the base time, when the query The loadRelevance function determines which chunk will ac- runs by itself with an empty buffer), that represents the tually be loaded, not only by looking at what is relevant system latency. Additionally we provide the total execution to the current query, but also taking other queries needs time, CPU-utilization, and the number of issued I/Os. The into consideration. To maximize sharing, it promotes chunks difference between the total time and the average stream that are needed by the highest number of starved queries, time comes from the random distribution of queries in the while at the same time slightly adjusting priorities to prefer streams, resulting in a significant variance of stream running chunks needed by many non-starved queries. times. For each query type and policy we provide the aver- If there are no free slots in the buffer pool, ABM’s find- age latency, normalized latency and number of I/Os issued FreeSlot routine needs to swap out the chunk with the low- when scheduling this query type. Additionally, Figure 4 est keepRelevance. This function is similar to loadRelevance, presents a detailed analysis of the I/O requests issued by except that when looking at queries, we treat queries on the each policy. border of starvation as being starved, to avoid evicting their As expected, the normal policy achieves the worst perfor- chunks, which would make them starved, hence schedulable, mance. As Figure 4 shows, it maintains multiple concurrent immediately. sequential scans, which leads to the largest number of I/O The relevance policy tries to maximize buffer pool reuse requests and a minimal buffer reuse. Since the query load is without slowing down fast queries. Thus, a slow query will relatively CPU-intensive, it still manages to use a significant re-use some of the chunks loaded by a fast query, skipping fraction of the CPU time. over chunks that it was too slow to process. These are read The attach policy allows merging requests from some queries. again later in the process, when the fast query might already As a result, it consistently improves the performance of all be gone. The access pattern generated by this approach may query types and the system throughput. Still, in Figure 4 be (quasi-) random, but since chunks consist of multiple we see that there are multiple (albeit fewer than in normal) sequential pages, disk (arm) latency is still well amortized. concurrent scans, since not all queries can share the same chunk sequence. Additionally, we can see that a faster query 5. ROW-WISE EXPERIMENTS can detach from a slower one (circled), resulting in a split of a reading sequence, further reducing the performance. 5.1 Benchmark settings The elevator policy shows a further reduction of the I/O requests and improvement of system throughput. This is a Benchmark system: We carried out row storage exper- result of its simple I/O pattern seen in Figure 4. However, iments on MonetDB/X100, using the PAX storage model, we see that the average normalized latency is very bad for which is equivalent to NSM in terms of I/O demand. The this policy. This is caused by the short queries that suffer chunk size used was 16MB, and the ABM buffer-pool size from a long waiting time, and achieve results even worse was set to 64 chunks (1GB), unless stated otherwise. Direct than in normal. This blocking of queries also degrades the I/O was used, to avoid operating system buffering. Our test overall system time, since it delays the start moment of the machine was a dual-CPU AMD Opteron 2GHz system with next query in a given stream. We also see that fast and slow 4GB of RAM. The storage facility was a 4-way RAID sys- queries differ little in performance - this is caused by the tem delivering slightly over 200 MB/s. fast queries waiting for the slow ones. Benchmark dataset: We used the standard TPC-H [30] Our new relevance policy achieves the best performance, 727
  • 6. Normal Attach Elevator Relevance System statistics Avg. stream time 283.72 160.81 138.41 99.55 Avg. normalized latency 6.42 3.72 13.52 1.96 Total time 453.06 281.19 244.45 238.16 CPU use 53.20% 81.31% 90.20% 93.94% I/O requests 4186 2325 1404 1842 Query statistics query count standalone latency(sec) norm. I/Os latency(sec) norm. I/Os latency(sec) norm. latency(sec) norm. I/Os cold time avg stddev lat. avg stddev lat. avg stddev lat. avg stddev lat. F-01 9 0.26 1.71 1.02 6.58 2 1.02 0.49 3.92 2 5.31 7.33 20.42 0.52 0.36 2.00 2 F-10 7 2.06 13.97 5.69 6.78 23 6.23 2.56 3.02 18 15.17 8.63 7.36 3.30 1.30 1.60 18 F-50 6 10.72 103.59 14.96 9.66 78 58.77 10.96 5.48 67 44.87 7.92 4.19 18.21 6.64 1.70 43 F-100 9 20.37 192.82 31.56 9.47 153 96.98 23.33 4.76 69 59.60 19.57 2.93 29.01 8.17 1.42 69 S-01 13 0.38 1.67 1.25 4.39 2 1.19 0.65 3.13 3 15.01 15.04 39.50 0.55 0.29 1.45 2 S-10 6 3.55 21.58 5.11 6.08 19 15.12 4.08 4.26 24 20.29 23.93 5.72 11.30 5.98 3.18 22 S-50 6 17.73 78.23 29.07 4.41 95 46.98 16.82 2.65 79 37.39 14.23 2.11 37.77 15.66 2.13 48 S-100 8 35.27 179.35 59.04 5.09 177 105.51 33.40 2.99 60 79.39 24.37 2.25 98.71 29.89 2.80 44 Table 2: Row-storage experiments (PAX) with a set of FAST and SLOW queries scanning 1%, 10%, 50% and 100% of a table. 16 streams of 4 random queries. All times in seconds. normal attach elevator relevance 250 200 chunk 150 100 50 quot;detachquot; 0 0 100 200 300 400 0 50 100 150 200 250 0 50 100 150 200 0 50 100 150 200 time (sec) time (sec) time (sec) time (sec) Figure 4: Behavior of different scheduling algorithms: disk accesses over time both in terms of global system parameters, as well as in most ized query latency (x axis) of normal, attach and elevator, query times. As Figure 4 shows, its I/O request pattern is with respect to our relevance policy. Each point represents much more dynamic than in all the other policies. Interest- a single run for one policy. The point labels describe runs in ingly, although relevance issues more I/Os than elevator, it a format “SPEED-SIZE”, where SPEED defines what query still results in a better throughput. This is because the sys- speeds were used (FS - mix of fast and slow, F - only fast, tem is mostly CPU-bound in this case, and extra available FFS - 2 times more fast than slow etc.), and SIZE repre- I/O time is efficiently used to satisfy further query require- sents the size of the range being scanned: S - short (mix of ments. Also, relevance differs from other policies by sig- queries reading 1, 2, 5, 10 and 20% of a table), M - mixed nificantly improving the performance of I/O bound queries. (1,2,10,50,100) and L - long (10,30,50,100). Average query latency is three times better than normal and From this experiment we conclude that indeed relevance, two times better than attach (I/O bound queries like F-100 representing the (1,1) point in this scatter plot, is consis- can even be three times faster than attach). tently better than all other policies. Recall that our objec- tive was to find a scheduling policy that works well both on the query throughput and on the query latency dimension, 5.2.1 Exploring Many Different Query Mixes and this is exactly what is shown in Figure 5. This scatter To provide more than accidental evidence of the superior plot also allows us to better understand the other policies. performance of relevance, we conducted experiments with We see that normal is inferior on both dimensions, whereas the same basic settings as in the previous section: 16 streams elevator gets close to relevance on throughput, but its per- of 4 queries, TPC-H table with scale factor 10 and buffer size formance is significantly hindered by poor query latencies. of 1GB. However, we changed the set of queries to explore As for the attach, it does find a balance between through- two dimensions: range size and data-processing speed. Fig- put and latency, but it is consistently beaten by relevance ure 5 shows the results, where we compare throughput as the in both dimensions. average stream running time (y axis) and average normal- 728
  • 7. CPU-intensive query set I/O-intensive query set 7 F-M (S-01,S-10,S-50,S-100, (F-01,F-10,F-50,F-100) F-L F-01,F-10,F-50,F-100) Number of I/O requests 6 1600 1400 1400 1200 Average stream time / RELEVANCE 1200 1000 5 1000 800 800 600 600 400 400 200 200 4 0 0 FFS-M 20 40 60 80 100 20 40 60 80 100 FFS-L 160 140 3 System time (sec) F-L SF-M 140 120 SF-L 120 100 F-S 100 2.5 80 80 60 60 F-M FFS-M 40 40 FFS-S F-S 20 2 SSF-L 20 0 0 FFS-S SF-S FFS-L SSF-M 20 40 60 80 100 20 40 60 80 100 Average normalized query latency SF-L SF-M SSF-S S-S S-M SF-L 1.5 SF-S S-L F-M SF-M 12 8 FFS-M 7 S-SSSF-S SSF-L 10 FFS-L F-S SSF-L SF-S 6 S-M 8 F-L SSF-M SSF-M 5 1.2 S-L FFS-S S-L S-S SSF-S 6 4 S-M 4 3 2 1 2 1 1 1.2 1.5 2 2.5 3 4 5 6 7 8 0 0 20 40 60 80 100 20 40 60 80 100 Average normalized query latency / RELEVANCE Buffer capacity Buffer capacity (percentage of the table size) (percentage of the table size) normal attach elevator normal attach elevator relevance Figure 5: Performance of various scheduling policies for query Figure 6: Behavior of all scheduling poli- sets varying in processing speed and scanned range cies under varying buffer pool capacities 5.2.2 Scaling The Data Volume 25 5% scans 100 20% scans 160 50% scans normal normal normal attach 90 attach 140 attach elevator elevator elevator With growing dataset sizes, the percentage of a relation 20 relevance 80 relevance 120 relevance Average query latency 70 that can be stored inside the buffer pool decreases. To sim- 15 60 100 ulate this, we tested the performance of different policies 50 80 under varying buffer size capacities, ranging from 12.5% to 10 40 60 100% of the full table size. This allows us to observe how dif- 30 40 5 20 ferent scheduling policies would behave under growing rela- 20 10 tion sizes, when a smaller fraction of a table can be buffered. 0 0 0 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 In this experiment we used a version of our relation trimmed- Number of queries Number of queries Number of queries down to 2 GB, that can be fully cached in the memory of our benchmark machine. Figure 6 shows the results of a bench- Figure 7: Performance comparison with varying mark with two sets of queries, one disk-intensive, consisting number of concurrent queries and scanned ranges only of fast queries, and a CPU-intensive one, consisting of a mix of fast and slow queries. We used 8 streams of 4 queries. As expected, the number of I/Os is decreasing with in- 5.2.3 Many Concurrent Queries creasing buffer size. In the disk-intensive case, the absolute system performance is directly influenced by this number, With more concurrent queries the opportunities for data- because the system is never CPU-bound. Still, thanks to reuse increase. In Figure 7 we present how the average better request scheduling, the relevance policy manages to query time changes when an increasing number of concur- improve the performance, issuing significantly fewer I/O re- rent queries reads 5, 20 and 50% of our relation, using a quests even when using a 87.5% buffer capacity. In the CPU- buffer pool of 1GB. As expected, the benefit of relevance intensive scenario, the number of I/Os influence the absolute over normal grows with larger scans and more concurrency. time only partially. This is because most algorithms man- We see that relevance also enhances its advantage over at- age to make a system CPU-bound with some buffer capacity. tach when more queries run concurrently, even exceeding the For relevance even a small buffer size of 12.5% of the full ta- factor two observed in Figures 6 and 5, when scan ranges are ble is enough to achieve this, as we can see by its mostly very or moderately selective. As this query set is uniform constant performance. in terms of range sizes, elevator can score close to relevance, Interestingly, Figure 6 shows that the performance ad- but we know from previous experiments that on combina- vantages of relevance over the other policies as observed in tions of short and long ranges it is not a viable competitor. Figure 5 are maximized when tables get bigger (i.e. at low buffered percentages). When looking at attach, the most 5.2.4 Scheduling-cost scalability viable competitor, we see that throughput in I/O bound sit- The cost of scheduling in relevance is significantly higher uations, as well as latency in CPU bound queries deteriorate than for other policies. For example, the loadRelevance func- strongly, and we expect the advantage of relevance to grow tion needs to check every query for every table chunk, and even more if table sizes become huge. do this for each chunk a query requests. Figure 8 presents the average times spent on scheduling when running 16 con- 729
  • 8. 100 0.01 LINEITEM COLUMNS DISK PAGE Fraction of the execution time orderkey partkey returnflag extendedprice comment chunk 0 Scheduling time (ms) 10 0.001 1 chunk 1 1e-04 0.1 0.01 1e-05 PFOR−DELTA(oid):3bit chunk 2 chunk 3 128 256 512 1024 2048 128 256 512 1024 2048 Number of chunks Number of chunks (chunk size = 2GB / number) (chunk size = 2GB / number) PFOR(oid):21bit PDICT(str):2bit 1% scan 100% scan 1% scan 100% scan decimal:64bit 10% scan 10% scan str:256bit Figure 8: Scheduling time and fraction of execution time when querying 1%, 10% and 100% of a 2GB relation with varying chunk size / number Figure 9: Compressed column storage: more com- current streams of 4 I/O-bound queries, each with the same plex logical chunk – physical page relationships relation stored in a different number of chunks of varying sizes. As expected, the overhead grows super-linearly - with The underlying storage manager should provide an efficient smaller chunks, every query needs to scan more of them, and means to tell which pages store data from a chunk. De- the decision process for each data request needs to consider pending on the physical data representation, a single logical more chunks. Still, even with the largest tested number of chunk can consist of multiple physical pages, and a single chunks, the scheduling overhead in the worst case does not physical page can contain data for multiple logical chunks. exceed 1% of the entire execution time. In situations when This logical-physical mismatch present in DSM becomes such overhead is not acceptable, e.g. with relations consist- even more problematic when using large physical blocks of a ing of hundreds of thousands of chunks, slightly less complex fixed size for I/O, a technique introduced in NSM for good policies can be considered. Also, our relevance implementa- concurrent bandwidth. The first problem here is that when tion is rather naive, leaving opportunities for optimizations loading a block for one chunk, a potentially large amount of that can significantly reduce the scheduling cost. data from a neighboring chunk can be loaded at the same time. In NSM this does not occur, as chunks and blocks are equivalent. In DSM, however, ABM needs to take spe- 6. IMPROVING DSM SCANS cial care to minimize situations in which this extra data is After our successful experiments with relevance in row- evicted before it could be used by a different chunk. Also, storage, we now turn our attention to column-stores. The keeping a full physical block in memory to provide it to an- decomposed storage model (DSM) has recently gained pop- other query in the near future may result in a sub-optimal ularity for its reduced disk-bandwidth needs, faster query buffer usage. The second problem is the buffer demand: in processing thanks to improved data locality [3], possibility NSM, for Q concurrent queries the system requires 2 ∗ Q of vectorized processing [5, 4] and additional compression (factor 2 because of prefetching) blocks, which is usually ac- opportunities [33, 1]. While we will show that relevance can ceptable, e.g. 512MB for 16 concurrent scans using 16MB also be successful here, we first discuss why DSM is much chunks/blocks. In DSM, however, to process a set of rows, more complex than NSM when it comes to scans in general data from multiple columns needs to be delivered. While and I/O scheduling in particular, and how this influenced some optimizations are possible, e.g. performing selections our Cooperative Scans framework. on some columns early [13], in general all the columns used by a query need to be loaded before the processing starts. 6.1 DSM Challenges As a result, a separate block is needed for every column Table columns stored using DSM may differ among each used in a query, increasing the buffer demand significantly other in physical data representation width, either because for multi-column scans, e.g. to 4GB for 16 scans reading 8 of the data types used, or because of compression. For exam- columns each. This can be improved (in both models) by ple, Figure 9 depicts column storage of a part of the TPC-H analyzing which pages from blocks are already processed, lineitem table, with some columns compressed with tech- and re-using them as-soon-as-possible for I/Os for different niques presented in [33]. This shows that we can not as- queries (with scatter-gather I/O). Both problems demon- sume a fixed number of tuples on a disk page, even within strate that in DSM the performance and resource demand a single column. As a result, a chunk cannot consist of a can be significantly improved by making algorithms page- fixed number of disk pages as in NSM. Instead, chunks are aware. However, such solutions significantly complicate the logical concepts, i.e. a horizontal partitioning of the table implementation, and our current system currently handles on the tuple granularity. For example, one may divide a only chunk-level policies. table in conceptual chunks of a 100.000 tuples, but it is also The final DSM problem for Cooperative Scans is the re- possible to use variable-size chunks, e.g. to make the chunk duced data reuse opportunity between queries. Figure 10 boundary always match some key boundary. This implies shows two queries reading a subset of a table and their I/O that chunk boundaries do not align with page boundaries. requirements in both NSM and DSM. Comparing the logical 730
  • 9. Logical level NSM DSM a b c d e f g h a b c d e f g h a b c d e f g h one logical chunk contains data useful for neighboring chunks. query 1 When the first chunk is freed, this data would be evicted. To query 2 avoid that, the choice of the next chunk for a given query is performed before the query blocks for a fully available overlap chunk. The already-loaded part of the chunk is marked as used, which prohibits its eviction. values finding space for a chunk – in DSM it is possible that a subset of columns in a buffered chunk is not useful for any I/O units query. ABM first evicts blocks belonging to such columns. Then, it starts evicting useful chunks, using the keepRele- Figure 10: Data overlapping in NSM and DSM vance function to victimize the least relevant chunk. Note that, unlike in NSM, this eviction process is iterative, since useRelevance(c, qtrigger ) due to different physical chunk sizes, possibly multiple chunks | cols = queryColumns(qtrigger ) need to be freed.1 | U = |interestedOverlappingQueries(c, cols)| column-aware relevance functions – Figure 11 shows | Pu = numberCachedPages(c, cols) that the DSM relevance functions need to take into account | return Pu /U the two-dimensional nature of column storage and the vary- loadRelevance(c, qtrigger ) | query cols = queryColumns(qtrigger ) ing physical chunk sizes. Like in NSM, useRelevance at- | queries = overlappingStarvedQueries(c, query cols) tempts to use chunks needed by few queries, to make them | cols = columnsUsedInQueries(queries) available for eviction. However, it also analyzes the size of | L = |queries| a chunk, to additionally promote chunks occupying more | Pl = |columnPagesToLoad(c, cols)| | return L/Pl buffer space. The loadRelevance function looks at the num- keepRelevance(c, qtrigger ) ber of starved queries that overlap with a triggering query | starved = almostStarvedQueries(c) and are interested in a given chunk. It also estimates the | cols = columnsUsedInQueries(starved) cost of loading a given chunk by computing the number of | E = |starved| | Pe = |cachedColumnPages(c, cols)| cold pages required for all needed columns. The returned | return E/Pe score promotes chunks that benefit multiple starved queries, and require a small amount of I/O. The DSM keepRelevance Figure 11: DSM Relevance Functions function promotes keeping chunks that occupy little space in the buffer pool and are useful for many queries. column loading order – a final issue in DSM is the order data need to the physical data demand in both models, we of columns when loading a chosen chunk. If some queries see that the vertical expansion present in DSM is usually depend only on a subset of the columns, it may be benefi- significantly smaller than the horizontal expansion present cial to load that subset first. Our current crude approach in NSM. As a result, fewer disk blocks are shared between is to just load column chunks in increasing size (in terms the queries, reducing the chance of reusing the same block of pages), which maximizes the number of “early” avail- for different scans. In NSM, for a block fetched by some able columns, allowing queries to be awoken earlier. An- query Q1 , the probability that another query Q2 , reading other approach could prioritize columns that faster satisfy T2 T2 tuples, will use it is proportional to TT , where TT is the some query needs. Finally, if data is compressed on disk but number of tuples in the entire table. In DSM, we need to kept decompressed in the buffer manager (like in SybaseIQ), take into account both vertical (as in NSM) and horizontal it might be valuable to first load compressed columns, so T2 their decompression is interleaved with loading the remain- overlap, reducing this probability to TT ∗ Poverlap (Q1 , Q2 ), where Poverlap (Q1 , Q2 ) is the probability of a column from ing ones. Q1 also being used in Q2 . 6.3 DSM Results 6.2 Cooperative Scans in DSM Table 3 presents DSM results for an experiment similar The DSM implementation of the traditional policies is to the one presented in Table 2 for NSM/PAX. One differ- straightforward. In normal, the order of I/Os is strictly de- ence is that we used a faster “slow” query, since, due to the termined by the query and LRU buffering is performed on a faster scanning achieved by DSM, with the original query the (chunk,column) level. DSM attach joins a query with most system was completely CPU bound, making it impossible overlap, where a crude measure of overlap is the number of to demonstrate performance differences of different policies columns two queries have in common. A more fine-grained with these queries. Also, we increased the lineitem size from measure would be to get average page-per-chunk statistics factor 10 ( 60Mtuples) to 40 ( 240Mtuples), to compensate for the columns of a table, and use these as weights when for the lower data-volume demand of DSM. Finally, since counting overlapping columns. Just like in NSM, the DSM our current implementation requires reserved memory for elevator policy still enforces a global cursor that sequentially each active chunk, and there are more chunks in DSM (one moves through the table. Obviously, it only loads the union for each column), we had to increase the buffer size from of all columns needed for this position by the active queries. 1GB to 1.5GB to allow concurrent execution of 16 queries. The framework for relevance in DSM is similar to that 1 If multiple chunks need to be freed, the dependency be- in NSM, with a few crucial differences, caused by the chal- tween them should be taken into account, something missed lenges discussed in the previous section: by the greedy iterative approach used here. Choosing the avoiding data waste – as discussed, with I/O based on optimal set of chunks to free is a good example of a knapsack large physical blocks, it is possible that a block loaded for problem surfacing in DSM I/O scheduling. 731