SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Dan Han and Eleni Stroulia
University of Alberta
stroulia@ualberta.ca
http://ssrg.cs.ualberta.ca
107/12/13 Cloud 2013
07/12/13 2Cloud 2013
 The General Research Problem
 The Geospatial Problem Instance
 The Data Set
 HBase data-organization alternatives
 Performance analysis
 Some Lessons Learned
07/12/13 3Cloud 2013
07/12/13 4Cloud 2013
07/12/13 Cloud 2013 5
07/12/13 6Cloud 2013
 Appropriate data models for
 time-series (MESOCA 2012)
 Geospatial (CLOUD 2013)
applications
 In progress:
 spatio-temporal applications
07/12/13 7Cloud 2013
07/12/13 9Cloud 2013
07/12/13 10Cloud 2013
 [1] built a multi-dimensional index layer on top of a one-
dimensional key-value store HBase to perform spatial queries.
 [2] presented a novel key formulation schema, based on R+-tree
for spatial index in HBase.
Focus on row-key design
no discussion about column and version design
07/12/13 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A
Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile
Data Management (1) 2011: 7-16
[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key
Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013
 Two Synthetic Datasets
 Uniform and ZipF distribution
 Based on Bixi dataset, each object includes
▪ station ID,
▪ latitude, longitude, station name, terminal name,
▪ number of docks
▪ number of bikes
 100 Million objects (70GB)
 in a 100km*100km simulated space
07/12/13 12Cloud 2013
 Regular Grid Indexing
 Row key: Grid rowID
 Column: Grid columnID
 Version: counter of Objects
 Value: one object in JSON format
07/12/13 13
Counter
Column ID
RowID
00 01 02 03
00
01
02
03
Cloud 2013
 Tie-based quad-tree Indexing
 Z-value Linearization
 Rowkey: Z-value
 Column: Object ID
 Value: one object in JSON Format
07/12/13 14
Z-Value
Object ID
Z-value
Cloud 2013
 Quad-Tree data model
 More rows with deeper tree
 Z-ordering linearization
(violates data locality)
 In-time construction vs. pre-
construction implies a
tradeoff between query
performance and memory
allocation
 Regular Grid data model
 Very easy to locate a cell by
row id and column id
 Cannot handle large space
and fine-grained grid
because in-memory indexes
are subject to memory
constraints
07/12/13 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013
07/12/13 16
O
bjectAttribute
Columnid-ObjectId
QTId-RowId
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
01
11
10
01 02 03 01 02 03
Space
Cloud 2013
07/12/13 17Cloud 2013
The row key is
the QT Z-value +
the RG row
index.
The row key is
the QT Z-value +
the RG row
index.
The column
name is the RG
column and the
object-ID
The column
name is the RG
column and the
object-ID
The attributes of
the data point
are stored in the
third dimension.
The attributes of
the data point
are stored in the
third dimension.
1. Compute minimum bounding square based on the query
input location and the range
2. Compute the quad-tree tiles that overlap with the
bounding square  Z-codes
3. Compute all the regular-grid cells indexes in these quad-
tree tiles  the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-
tree; process with user-level coprocessors on the HBase
regions
5. Collect the results of the sub-queries at the client-side
07/12/13 18Cloud 2013
07/12/13 20Cloud 2013
07/12/13 21
00 02 04 06
00
02
04
06
Cloud 2013
07/12/13 22
00 02 04 06
00
02
04
06
09-00
09-04
Cloud 2013
1. Estimate the search range (density-based range
estimation)
2. Compute indices of rows and columns (steps 2 and 3 of
Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the
search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input
location
07/12/13 23Cloud 2013
 Experiment Environment
 A four-node cluster on virtual machines with Ubuntu on
OpenStack
 Hadoop 1.0.2 (replication factor is 2), HBase 0.94
 HBase Configuration
▪ 5K Caching Size
▪ Block cache is true
▪ ROWCOL bloom filter
 Query processing Implementation
 Native java API
 User-Level Coprocessor Implementation
07/12/13 24Cloud 2013
 The granularity of grid affects query-processing
performance
 Explore the “best” cell configuration of each model
 Quad-tree=>(t= 1)
 RG=>(t=0.1)
 HGrid=>(T=10,t=0.1)
07/12/13 25Cloud 2013
HG:≈10:0.1 fewer sub-queries
more false positives
HG:≈1:0.1 more sub-queries
fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows
07/12/13 26
 Given a location and a radius,
 Return the data points, located within a distance
less or equal to the radius from the input location
Cloud 2013
 Given the coordinates of a location,
 Return the K points nearest to the location
07/12/13 27Cloud 2013
07/12/13 28Cloud 2013
07/12/13 29Cloud 2013
 Data Organization
 Short row key and column name
 Better to have one column family and few columns
 Not large amount of data in one row
 Row key design should ease pruning unrelated data
 3rd dimension can store data as well
 Bloom Filter should be configured to prune rows and columns
 Compression can reduce the amount of data transmission
07/12/13 30Cloud 2013
 Query Processing
 Scanned rows for one query should not exceed the scan cache
size, otherwise, split the query into sub-queries.
 “Scan” is better than “Get” for retrieving discontinuous keys, even
though the unrelated data
 “Scan” for small queries, while Coprocessor for large queries
 Better to split one large query into multiple sub-queries than use
one query with row filter mechanism
07/12/13 31Cloud 2013
 Benefits from the good locality of the RG index; suffers from
the poor locality of the z-ordering QT linearization
 Performance could be improved with other linearization techniques
 Can be flexibly configured and extended
 The QT index can be replaced by the hash code of each sub-space
 The granularity in the second stage can be varied from sub-space to
sub-space based on the various densities
 Is more suitable for homogeneously covered and
discontinuous spaces
07/12/13 32Cloud 2013
 A Data Model for spatio-temporal dataset
 Towards a General Systematic Guidance for Column
Families and other NoSQL databases
 To apply the data model into cloud-based
applications and big data analytics system
07/12/13 33Cloud 2013

Contenu connexe

Tendances

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축Kwang Woo NAM
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 

Tendances (20)

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 

Similaire à HGrid A Data Model for Large Geospatial Data Sets in HBase

Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architecturesRaji Gogulapati
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxAsst.prof M.Gokilavani
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving dataiaemedu
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Data Intensive Grid Service Model
Data Intensive Grid Service ModelData Intensive Grid Service Model
Data Intensive Grid Service Modelgomathynayagam
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Sprinkle Data Inc
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationRob Emanuele
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Assimilating sense into disaster recovery databases and judgement framing pr...
Assimilating sense into disaster recovery databases and  judgement framing pr...Assimilating sense into disaster recovery databases and  judgement framing pr...
Assimilating sense into disaster recovery databases and judgement framing pr...IJECEIAES
 

Similaire à HGrid A Data Model for Large Geospatial Data Sets in HBase (20)

Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
 
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptxCCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
CCS334 BIG DATA ANALYTICS Session 3 Distributed models.pptx
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
No SQL Cassandra
No SQL CassandraNo SQL Cassandra
No SQL Cassandra
 
Data Intensive Grid Service Model
Data Intensive Grid Service ModelData Intensive Grid Service Model
Data Intensive Grid Service Model
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
50120130406035
5012013040603550120130406035
50120130406035
 
Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)Definitive Guide to Select Right Data Warehouse (2020)
Definitive Guide to Select Right Data Warehouse (2020)
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
G0354451
G0354451G0354451
G0354451
 
Assimilating sense into disaster recovery databases and judgement framing pr...
Assimilating sense into disaster recovery databases and  judgement framing pr...Assimilating sense into disaster recovery databases and  judgement framing pr...
Assimilating sense into disaster recovery databases and judgement framing pr...
 

Dernier

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 

Dernier (20)

IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 

HGrid A Data Model for Large Geospatial Data Sets in HBase

  • 1. Dan Han and Eleni Stroulia University of Alberta stroulia@ualberta.ca http://ssrg.cs.ualberta.ca 107/12/13 Cloud 2013
  • 3.  The General Research Problem  The Geospatial Problem Instance  The Data Set  HBase data-organization alternatives  Performance analysis  Some Lessons Learned 07/12/13 3Cloud 2013
  • 7.  Appropriate data models for  time-series (MESOCA 2012)  Geospatial (CLOUD 2013) applications  In progress:  spatio-temporal applications 07/12/13 7Cloud 2013
  • 10.  [1] built a multi-dimensional index layer on top of a one- dimensional key-value store HBase to perform spatial queries.  [2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase. Focus on row-key design no discussion about column and version design 07/12/13 11 [1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16 [2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26 Cloud 2013
  • 11.  Two Synthetic Datasets  Uniform and ZipF distribution  Based on Bixi dataset, each object includes ▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks ▪ number of bikes  100 Million objects (70GB)  in a 100km*100km simulated space 07/12/13 12Cloud 2013
  • 12.  Regular Grid Indexing  Row key: Grid rowID  Column: Grid columnID  Version: counter of Objects  Value: one object in JSON format 07/12/13 13 Counter Column ID RowID 00 01 02 03 00 01 02 03 Cloud 2013
  • 13.  Tie-based quad-tree Indexing  Z-value Linearization  Rowkey: Z-value  Column: Object ID  Value: one object in JSON Format 07/12/13 14 Z-Value Object ID Z-value Cloud 2013
  • 14.  Quad-Tree data model  More rows with deeper tree  Z-ordering linearization (violates data locality)  In-time construction vs. pre- construction implies a tradeoff between query performance and memory allocation  Regular Grid data model  Very easy to locate a cell by row id and column id  Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints 07/12/13 15 How much unrelated data is examined in a query matters a lot! Cloud 2013
  • 15. 07/12/13 16 O bjectAttribute Columnid-ObjectId QTId-RowId A A A A A A A A A B B B B B B B B B C C C C C C C C C D D D D D D D D D 00 01 11 10 01 02 03 01 02 03 Space Cloud 2013
  • 16. 07/12/13 17Cloud 2013 The row key is the QT Z-value + the RG row index. The row key is the QT Z-value + the RG row index. The column name is the RG column and the object-ID The column name is the RG column and the object-ID The attributes of the data point are stored in the third dimension. The attributes of the data point are stored in the third dimension.
  • 17. 1. Compute minimum bounding square based on the query input location and the range 2. Compute the quad-tree tiles that overlap with the bounding square  Z-codes 3. Compute all the regular-grid cells indexes in these quad- tree tiles  the secondary index of rows and columns 4. Issue one sub-query for each selected tile of the quad- tree; process with user-level coprocessors on the HBase regions 5. Collect the results of the sub-queries at the client-side 07/12/13 18Cloud 2013
  • 19. 07/12/13 21 00 02 04 06 00 02 04 06 Cloud 2013
  • 20. 07/12/13 22 00 02 04 06 00 02 04 06 09-00 09-04 Cloud 2013
  • 21. 1. Estimate the search range (density-based range estimation) 2. Compute indices of rows and columns (steps 2 and 3 of Range Query) 3. Issue a scan query to retrieve the relevant data points 4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3 5. Sort the return set in increasing distance from the input location 07/12/13 23Cloud 2013
  • 22.  Experiment Environment  A four-node cluster on virtual machines with Ubuntu on OpenStack  Hadoop 1.0.2 (replication factor is 2), HBase 0.94  HBase Configuration ▪ 5K Caching Size ▪ Block cache is true ▪ ROWCOL bloom filter  Query processing Implementation  Native java API  User-Level Coprocessor Implementation 07/12/13 24Cloud 2013
  • 23.  The granularity of grid affects query-processing performance  Explore the “best” cell configuration of each model  Quad-tree=>(t= 1)  RG=>(t=0.1)  HGrid=>(T=10,t=0.1) 07/12/13 25Cloud 2013 HG:≈10:0.1 fewer sub-queries more false positives HG:≈1:0.1 more sub-queries fewer false positives HG:≈10:0.01 more rows HG:≈10:0.1 fewer rows
  • 24. 07/12/13 26  Given a location and a radius,  Return the data points, located within a distance less or equal to the radius from the input location Cloud 2013
  • 25.  Given the coordinates of a location,  Return the K points nearest to the location 07/12/13 27Cloud 2013
  • 28.  Data Organization  Short row key and column name  Better to have one column family and few columns  Not large amount of data in one row  Row key design should ease pruning unrelated data  3rd dimension can store data as well  Bloom Filter should be configured to prune rows and columns  Compression can reduce the amount of data transmission 07/12/13 30Cloud 2013
  • 29.  Query Processing  Scanned rows for one query should not exceed the scan cache size, otherwise, split the query into sub-queries.  “Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data  “Scan” for small queries, while Coprocessor for large queries  Better to split one large query into multiple sub-queries than use one query with row filter mechanism 07/12/13 31Cloud 2013
  • 30.  Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization  Performance could be improved with other linearization techniques  Can be flexibly configured and extended  The QT index can be replaced by the hash code of each sub-space  The granularity in the second stage can be varied from sub-space to sub-space based on the various densities  Is more suitable for homogeneously covered and discontinuous spaces 07/12/13 32Cloud 2013
  • 31.  A Data Model for spatio-temporal dataset  Towards a General Systematic Guidance for Column Families and other NoSQL databases  To apply the data model into cloud-based applications and big data analytics system 07/12/13 33Cloud 2013

Notes de l'éditeur

  1. Cloud Computing, is attracting business owners for the perceived benefits, Such as 1 Elasticity which provides on-demand service based on the fluctuating load -- example 2 Excellent system scalability -- there is no limit to storage 3 Low latency and high availability of service. -- latency is a strange thing to discuss… Given these advantages, some enterprises have been working on migrating the legacy applications to the cloud. Currently, the majority of applications deployed in the cloud include social networking, online shopping, monitoring system. In these applications, the data grows monotonously over time. In order to improve the existing services and discover new knowledge, business owners are committing substantial budgets on Analysis of these large time-series data, ranging from simple descriptive statistics to complex analytics. In this movement, the success adoption requires a new model of storage.
  2. Therefore, to address these challenge in this migration, we aim to develop a systematic method for guiding data organization in NoSQL databases, given the data type, the data size and its usage pattern. We start our investigation with HBase, which is a NoSQL database offering. It is built on top of Hadoop. It provides two frameworks for parallel distributed computation. One is MapReduce, and the other one is Coprocessor. MapReduce is very effective for distributed computation over data stored within its tables, But in many cases, for example, simple additive or aggregating operations like summing, counting, Coprocessor can give a dramatic performance improvement over HBase’s already good scanning performance. Therefore, we used Coprocessor implementation in our experiment.
  3. Both of them investigate how to access the multidimensional data efficient with spatial indices, which is a part of problem that we are addressing in this paper. Their methods demonstrate efficient performance with the spatial indices. However, both only focus on the row key design in HBase for data organization, there is little discussion about the column and version design. To work out an appropriate data model for geospatial datasets which can be easily and directly applied to the location-based applications, in addition of the row key, we also take column and version into account to model the data in HBase. Furthermore, we implemented the queries with HBase Coprocessor to harness the parallelism benefits, while the above studies processed the queries with HBase Scan.
  4. It relies on a regular-grid index. The row key is the row index of the cell in the grid, the column is the column index of the cell, Version: counter of objects. So we can see the third dimension holds a stack of data points located in the same grid cell. Value: each storage cell represents one object in JSON format holding all other attributes and values
  5. It relies on a trie-based quad-tree index. Applies Z-ordering to transform the two-dimensional spatial data into an one-dimensional array. In this model, the row key, is Z-value. The column is the object id which locates in the cell. Usually people encode the cells with binary digits. But as the row key should be short in HBase, we use decimal encoding here.
  6. QT If the index is built in real time for each query, the construction cost dominates many small queries. If the index is maintained in memory, the granularity of the grid is limited by the amount of memory available, since the memory needed to maintain the index increases as the depth of the tree increases and the size of the grid cells becomes smaller. RG The third dimension holds a stack of data points located in the same grid cell, and an index is maintained to keep the count of objects in each cell stack in order to support updates.
  7. 1) the data-set space is divided into equally-sized rectangular tiles T, encoded with their Z-value. 2) the data points are organized in a regular grid of continuous uniform fine-grained cells. In this model, each data point is uniquely identified in terms of its row key and column name.
  8. The row key is the concatenation of the quad-tree Z-value and the Regular Grid row index. The column name is the concatenation of the Regular Grid column and the object id of the data point. The attributes of the data point are stored in the third dimension.
  9. Range queries are commonly used in location based applications. Given the coordinates of a location and a radius, a vector of data points, located within a distance less or equal to the radius from the input location is returned. Relying on the HGrid data model, to answer this query the following process is followed:
  10. In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
  11. In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
  12. In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel.
  13. In this example, four tiles are involved. If the query is computed in one call, the scan range is from 03 to 09, and it will include the irrelevant tiles “04,05,07,08,10,11”. Therefore, instead of triggering one query, we split them into four sub-queries, and each sub-query corresponds to one tile. The sub-queries are called in sequence, and in each sub-query, the query is executed by coprocessor in parallel. But still within a coprocessor tile, irrelevant regular-grid cells will be seen, so we filter them out
  14. Have not got a better way to describe this algorithm
  15. Our experiments were performed on a four-node cluster, running on four virtual machines on an Openstack Cloud. The virtual machines run 64bit Ubuntu 11.10 and have 2 cores, 4GB of RAM, and a 200 GB disk. We used Hadoop version 1.0.2, and HBase version 0.94. Hadoop and HBase were each given 2GB of Heap size in every running node. Configurations: 1) HDFS was configured with a replication factor of 2. 2) The data was compressed with gzip. 3) the ROWCOL filter was applied on each table. 4) The scan cache size was set to 5K and the block cache was set to true, for the query processing. We implemented the Range Query processing with the Coprocessor framework and KNN with Scan in order to examine the implications of these different implementations to performance.
  16. Across these three data models, a common configuration parameter is the size of each cell, which determines the granularity of grid. This is a very important variable which substantially affects the query-processing performance. Therefore, before we compare the three data models against each other, we have explored the best cell configuration for each model. We varied the size of the cell to observe how the different sizes of cell affect the performance of each data model. Result: In our experiments, for the QT data model, the appropriate cell configuration was 1, while for the RG data model the acceptable cell size was 0.1.
  17. We evaluated range query performance under three data models with both uniform and Zipf distribution data. The table shows the query response time of the three data models for various ranges when the system contains 100 million objects. As the radius increases, the size of irrelevant data vs the return-set size ration increases, the running time also increases because more data points are retrieved. Comparing the three models, we can see that the regular-grid data model outperforms the others. Because it supports better data locality, it demonstrates better performance since the percentage of irrelevant rows scanned is low. The HGrid data model is much better than quadtree data model and worse than the regular-grid data model. The same performance trends persist with both uniform and skewed data.
  18. We now evaluate the performance for k Nearest Neighbor (kNN) queries using the same data set, under the three data models. This table shows the response time (in seconds) for kNN queries, where k takes the values 1, 10, 100, 1,000, and 10,000. As the density-based range estimation method is employed , there is only one scan operation in the query processing for uniform data, while for skewed data, more than one scan iterations are invoked to retrieve the data. That is why the performance with skewed data under all data models is a little worse than that with the uniform data set. For both uniform and skewed data, the Regular-grid data model demonstrates best performance among the three data models; the HGrid data model come second with slightly worse performance than the regular-grid data model; and the quadtree data model is outperformed by the other two. The poor locality preservation, due to the Z-order linearization method, contributes to the poor performance of the quadtree data model, and also impacts the performance of HGrid, albeit less strongly. For skewed data, with too many false positives, the query with the data points having more than 70% probability cannot get the result below the timeout threshold under all data models when k equals to 10K. To improve performance, a finer granularity is required to filter irrelevant data scanning.
  19. In summary, the query performance of the HGrid data model is better than the quadtree data model and worse than the RG data model. Benefits the good locality of the second-tier regular-grid index, while at the same time suffering from the poor locality of the Zordering linearization at the first tier. Better performance can potentially be obtained with alternative linearization techniques. For skewed data, the HGrid behaves better with an appropriate configuration, while the regular-grid and QT data models are subject to memory constraints. Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.
  20. The row key and column name should be short, since they are stored with every cell in the row. It is better to have one column family, only introducing more column families in the case where data access is usually column scoped. The number of columns should be limited. A number in the hundreds is likely to lead to good performance. The amount of data in one row should be kept relatively small. The cost (in time) of retrieving a row has n data increases more than twice with n [12]. The row key should be designed to support pruning of unrelated data easily. When the third dimension is used for storing other information rather than time-to-live values, it is preferable to keep it shallow, and be limited to containing up to no more than hundreds of data points, as deep stacks lead to poor insertion performance. The Bloom Filter [10] should be configured as it can accelerate the performance by pruning the data from both row and column sides. Compression can improve the performance by reducing the amount of data transmission.
  21. Scan operations are preferable to Get operations for retrieving discontinuous keys, even though the Scan result is bound to also include data points that are not part of the response data set. It is more efficient to Get one row with n data points than n rows with one data point each [12]. It is advisable to narrow the range of queried columns with the Filter mechanism. The number of rows to be scanned for a query should not exceed the scan cache size, which depends on client and server memory. Otherwise, it is better to split the query into several sub-queries. When there are too many unrelated rows within the defined scan range, splitting one query into multiple sub-queries with multiple Scan operations is more efficient than one query with Filter mechanism to retrieve rows one by one. The Scan operation is preferable for small queries, while Coprocessor should be used for large queries.
  22. Can be flexibly configured and extended 1) The quad-tree index can be replaced by the hash code of each sub-space 2) The point-based quad-tree index method is employed. 3) The granularity in the second stage can be varied from sub-space to sub-space based on the various densities. Therefore, HGrid is more scalable and suitable for both homogeneously covered and discontinuous spaces.