3. The General Research Problem
The Geospatial Problem Instance
The Data Set
HBase data-organization alternatives
Performance analysis
Some Lessons Learned
07/12/13 3Cloud 2013
10. [1] built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial queries.
[2] presented a novel key formulation schema, based on R+-tree
for spatial index in HBase.
Focus on row-key design
no discussion about column and version design
07/12/13 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A
Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile
Data Management (1) 2011: 7-16
[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key
Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013
11. Two Synthetic Datasets
Uniform and ZipF distribution
Based on Bixi dataset, each object includes
▪ station ID,
▪ latitude, longitude, station name, terminal name,
▪ number of docks
▪ number of bikes
100 Million objects (70GB)
in a 100km*100km simulated space
07/12/13 12Cloud 2013
12. Regular Grid Indexing
Row key: Grid rowID
Column: Grid columnID
Version: counter of Objects
Value: one object in JSON format
07/12/13 13
Counter
Column ID
RowID
00 01 02 03
00
01
02
03
Cloud 2013
13. Tie-based quad-tree Indexing
Z-value Linearization
Rowkey: Z-value
Column: Object ID
Value: one object in JSON Format
07/12/13 14
Z-Value
Object ID
Z-value
Cloud 2013
14. Quad-Tree data model
More rows with deeper tree
Z-ordering linearization
(violates data locality)
In-time construction vs. pre-
construction implies a
tradeoff between query
performance and memory
allocation
Regular Grid data model
Very easy to locate a cell by
row id and column id
Cannot handle large space
and fine-grained grid
because in-memory indexes
are subject to memory
constraints
07/12/13 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013
16. 07/12/13 17Cloud 2013
The row key is
the QT Z-value +
the RG row
index.
The row key is
the QT Z-value +
the RG row
index.
The column
name is the RG
column and the
object-ID
The column
name is the RG
column and the
object-ID
The attributes of
the data point
are stored in the
third dimension.
The attributes of
the data point
are stored in the
third dimension.
17. 1. Compute minimum bounding square based on the query
input location and the range
2. Compute the quad-tree tiles that overlap with the
bounding square Z-codes
3. Compute all the regular-grid cells indexes in these quad-
tree tiles the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-
tree; process with user-level coprocessors on the HBase
regions
5. Collect the results of the sub-queries at the client-side
07/12/13 18Cloud 2013
21. 1. Estimate the search range (density-based range
estimation)
2. Compute indices of rows and columns (steps 2 and 3 of
Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the
search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input
location
07/12/13 23Cloud 2013
22. Experiment Environment
A four-node cluster on virtual machines with Ubuntu on
OpenStack
Hadoop 1.0.2 (replication factor is 2), HBase 0.94
HBase Configuration
▪ 5K Caching Size
▪ Block cache is true
▪ ROWCOL bloom filter
Query processing Implementation
Native java API
User-Level Coprocessor Implementation
07/12/13 24Cloud 2013
23. The granularity of grid affects query-processing
performance
Explore the “best” cell configuration of each model
Quad-tree=>(t= 1)
RG=>(t=0.1)
HGrid=>(T=10,t=0.1)
07/12/13 25Cloud 2013
HG:≈10:0.1 fewer sub-queries
more false positives
HG:≈1:0.1 more sub-queries
fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows
24. 07/12/13 26
Given a location and a radius,
Return the data points, located within a distance
less or equal to the radius from the input location
Cloud 2013
25. Given the coordinates of a location,
Return the K points nearest to the location
07/12/13 27Cloud 2013
28. Data Organization
Short row key and column name
Better to have one column family and few columns
Not large amount of data in one row
Row key design should ease pruning unrelated data
3rd dimension can store data as well
Bloom Filter should be configured to prune rows and columns
Compression can reduce the amount of data transmission
07/12/13 30Cloud 2013
29. Query Processing
Scanned rows for one query should not exceed the scan cache
size, otherwise, split the query into sub-queries.
“Scan” is better than “Get” for retrieving discontinuous keys, even
though the unrelated data
“Scan” for small queries, while Coprocessor for large queries
Better to split one large query into multiple sub-queries than use
one query with row filter mechanism
07/12/13 31Cloud 2013
30. Benefits from the good locality of the RG index; suffers from
the poor locality of the z-ordering QT linearization
Performance could be improved with other linearization techniques
Can be flexibly configured and extended
The QT index can be replaced by the hash code of each sub-space
The granularity in the second stage can be varied from sub-space to
sub-space based on the various densities
Is more suitable for homogeneously covered and
discontinuous spaces
07/12/13 32Cloud 2013
31. A Data Model for spatio-temporal dataset
Towards a General Systematic Guidance for Column
Families and other NoSQL databases
To apply the data model into cloud-based
applications and big data analytics system
07/12/13 33Cloud 2013
Notes de l'éditeur
Cloud Computing, is attracting business owners for the perceived benefits, Such as 1 Elasticity which provides on-demand service based on the fluctuating load -- example 2 Excellent system scalability -- there is no limit to storage 3 Low latency and high availability of service. -- latency is a strange thing to discuss… Given these advantages, some enterprises have been working on migrating the legacy applications to the cloud. Currently, the majority of applications deployed in the cloud include social networking, online shopping, monitoring system. In these applications, the data grows monotonously over time. In order to improve the existing services and discover new knowledge, business owners are committing substantial budgets on Analysis of these large time-series data, ranging from simple descriptive statistics to complex analytics. In this movement, the success adoption requires a new model of storage.
Therefore, to address these challenge in this migration, we aim to develop a systematic method for guiding data organization in NoSQL databases, given the data type, the data size and its usage pattern. We start our investigation with HBase, which is a NoSQL database offering. It is built on top of Hadoop. It provides two frameworks for parallel distributed computation. One is MapReduce, and the other one is Coprocessor. MapReduce is very effective for distributed computation over data stored within its tables, But in many cases, for example, simple additive or aggregating operations like summing, counting, Coprocessor can give a dramatic performance improvement over HBase’s already good scanning performance. Therefore, we used Coprocessor implementation in our experiment.
Both of them investigate how to access the multidimensional data efficient with spatial indices, which is a part of problem that we are addressing in this paper. Their methods demonstrate efficient performance with the spatial indices. However, both only focus on the row key design in HBase for data organization, there is little discussion about the column and version design. To work out an appropriate data model for geospatial datasets which can be easily and directly applied to the location-based applications, in addition of the row key, we also take column and version into account to model the data in HBase. Furthermore, we implemented the queries with HBase Coprocessor to harness the parallelism benefits, while the above studies processed the queries with HBase Scan.
It relies on a regular-grid index. The row key is the row index of the cell in the grid, the column is the column index of the cell, Version: counter of objects. So we can see the third dimension holds a stack of data points located in the same grid cell. Value: each storage cell represents one object in JSON format holding all other attributes and values
It relies on a trie-based quad-tree index. Applies Z-ordering to transform the two-dimensional spatial data into an one-dimensional array. In this model, the row key, is Z-value. The column is the object id which locates in the cell. Usually people encode the cells with binary digits. But as the row key should be short in HBase, we use decimal encoding here.
QT If the index is built in real time for each query, the construction cost dominates many small queries. If the index is maintained in memory, the granularity of the grid is limited by the amount of memory available, since the memory needed to maintain the index increases as the depth of the tree increases and the size of the grid cells becomes smaller. RG The third dimension holds a stack of data points located in the same grid cell, and an index is maintained to keep the count of objects in each cell stack in order to support updates.
1) the data-set space is divided into equally-sized rectangular tiles T, encoded with their Z-value. 2) the data points are organized in a regular grid of continuous uniform fine-grained cells. In this model, each data point is uniquely identified in terms of its row key and column name.
The row key is the concatenation of the quad-tree Z-value and the Regular Grid row index. The column name is the concatenation of the Regular Grid column and the object id of the data point. The attributes of the data point are stored in the third dimension.
Range queries are commonly used in location based applications. Given the coordinates of a location and a radius, a vector of data points, located within a distance less or equal to the radius from the input location is returned. Relying on the HGrid data model, to answer this query the following process is followed:
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel. But still within a coprocessor tile, irrelevant regular-grid cells will be seen, so we filter them out
Have not got a better way to describe this algorithm
Our experiments were performed on a four-node cluster, running on four virtual machines on an Openstack Cloud. The virtual machines run 64bit Ubuntu 11.10 and have 2 cores, 4GB of RAM, and a 200 GB disk. We used Hadoop version 1.0.2, and HBase version 0.94. Hadoop and HBase were each given 2GB of Heap size in every running node. Configurations: 1) HDFS was configured with a replication factor of 2. 2) The data was compressed with gzip. 3) the ROWCOL filter was applied on each table. 4) The scan cache size was set to 5K and the block cache was set to true, for the query processing. We implemented the Range Query processing with the Coprocessor framework and KNN with Scan in order to examine the implications of these different implementations to performance.
Across these three data models, a common configuration parameter is the size of each cell, which determines the granularity of grid. This is a very important variable which substantially affects the query-processing performance. Therefore, before we compare the three data models against each other, we have explored the best cell configuration for each model. We varied the size of the cell to observe how the different sizes of cell affect the performance of each data model. Result: In our experiments, for the QT data model, the appropriate cell configuration was 1, while for the RG data model the acceptable cell size was 0.1.
We evaluated range query performance under three data models with both uniform and Zipf distribution data. The table shows the query response time of the three data models for various ranges when the system contains 100 million objects. As the radius increases, the size of irrelevant data vs the return-set size ration increases, the running time also increases because more data points are retrieved. Comparing the three models, we can see that the regular-grid data model outperforms the others. Because it supports better data locality, it demonstrates better performance since the percentage of irrelevant rows scanned is low. The HGrid data model is much better than quadtree data model and worse than the regular-grid data model. The same performance trends persist with both uniform and skewed data.
We now evaluate the performance for k Nearest Neighbor (kNN) queries using the same data set, under the three data models. This table shows the response time (in seconds) for kNN queries, where k takes the values 1, 10, 100, 1,000, and 10,000. As the density-based range estimation method is employed , there is only one scan operation in the query processing for uniform data, while for skewed data, more than one scan iterations are invoked to retrieve the data. That is why the performance with skewed data under all data models is a little worse than that with the uniform data set. For both uniform and skewed data, the Regular-grid data model demonstrates best performance among the three data models; the HGrid data model come second with slightly worse performance than the regular-grid data model; and the quadtree data model is outperformed by the other two. The poor locality preservation, due to the Z-order linearization method, contributes to the poor performance of the quadtree data model, and also impacts the performance of HGrid, albeit less strongly. For skewed data, with too many false positives, the query with the data points having more than 70% probability cannot get the result below the timeout threshold under all data models when k equals to 10K. To improve performance, a finer granularity is required to filter irrelevant data scanning.
In summary, the query performance of the HGrid data model is better than the quadtree data model and worse than the RG data model. Benefits the good locality of the second-tier regular-grid index, while at the same time suffering from the poor locality of the Zordering linearization at the first tier. Better performance can potentially be obtained with alternative linearization techniques. For skewed data, the HGrid behaves better with an appropriate configuration, while the regular-grid and QT data models are subject to memory constraints. Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.
The row key and column name should be short, since they are stored with every cell in the row. It is better to have one column family, only introducing more column families in the case where data access is usually column scoped. The number of columns should be limited. A number in the hundreds is likely to lead to good performance. The amount of data in one row should be kept relatively small. The cost (in time) of retrieving a row has n data increases more than twice with n [12]. The row key should be designed to support pruning of unrelated data easily. When the third dimension is used for storing other information rather than time-to-live values, it is preferable to keep it shallow, and be limited to containing up to no more than hundreds of data points, as deep stacks lead to poor insertion performance. The Bloom Filter [10] should be configured as it can accelerate the performance by pruning the data from both row and column sides. Compression can improve the performance by reducing the amount of data transmission.
Scan operations are preferable to Get operations for retrieving discontinuous keys, even though the Scan result is bound to also include data points that are not part of the response data set. It is more efficient to Get one row with n data points than n rows with one data point each [12]. It is advisable to narrow the range of queried columns with the Filter mechanism. The number of rows to be scanned for a query should not exceed the scan cache size, which depends on client and server memory. Otherwise, it is better to split the query into several sub-queries. When there are too many unrelated rows within the defined scan range, splitting one query into multiple sub-queries with multiple Scan operations is more efficient than one query with Filter mechanism to retrieve rows one by one. The Scan operation is preferable for small queries, while Coprocessor should be used for large queries.
Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.