Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
A 3 dimensional data model in hbase for large time-series dataset-20120915
1. Dan Han, Eleni Stroulia
University of Alberta
9/20/2012
MESOCA 2012 1
2. Outline
» Background and Motivation
» Related Work
» A 3-Dimensional Data Model in HBase
» Case Study and Experiment Results
» Discussion
» Conclusions and Future Work
9/20/2012
MESOCA 2012 2
3. Migrating Applications
To the Cloud
» Cloud is an attractive computing platform
˃ Elasticity, Excellent Scalability, High Availability, Low Operating
Cost
» Applications are moving to the cloud
˃ Social networking, online shopping, monitoring system
˃ Time-Series data: grows monotonously over time
˃ Analysis of large scale time-series data
+ May lead to new knowledge
+ May lead to improvements of existing services
» Success adoption of this movement paradigm requires a
9/20/2012
new model of storage
MESOCA 2012 3
4. Migrating RDBMS Content
To NoSQL
» From RDBMS to NoSQL storage systems
˃ Enable the storage of big data, in order of row key
˃ Scale horizontally across storage nodes easily
˃ Not much data-organization support
» Migration challenges
˃ Few experiences and principles to follow
˃ Steep learning curve for programming
˃ Much experimentation is required before deployment
+ Much time is spent in designing the data schema
+ The “wrong” schema may lead to inefficient, high-latency queries
9/20/2012
MESOCA 2012 4
5. We need Design Patterns for
HBase Schemas
» Our objective is to develop a systematic method for
˃ Guiding data organization in NoSQL databases, given
˃ the types of data stored
˃ the amount of data
˃ The data-usage patterns
» We start our investigation with HBase
˃ A NoSQL database offering, built on top of Hadoop
˃ Parallel Distributed Computation
+ MapReduce Framework
+ Coprocessor Framework
9/20/2012
MESOCA 2012 5
6. Related Work
» Talks in HBaseCon2012, held in May
˃ Data schema and Coprocessor are two main topics
˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe
» Organizing time-series data in period-specific “buckets”
˃ OpenTSDB: a distributed scalable time-series database, on top of
HBase
˃ A data Model in Cassandra, another NoSQL database offering
˃ Applied in our case study
9/20/2012
MESOCA 2012 6
7. Data Organization in HBase
» Cell in HBase
˃(Row, Family: Column, Version) => (X,Y,Z) = value
Y Z
Y
X VS X
Schema/ Row Family: Column Version
dimension
2-D unique id - varying properties current
timestamp timestamp
9/20/2012
3-D unique id varying properties timestamps
MESOCA 2012 7
8. Case study:
The Datasets
» Cosmology Dataset
˃ Product of an N-body simulation
˃ Three types of particles: dark matter, gas and star
˃ Particles evolve over a series of discrete timestamps
˃ Each snapshot records the properties of all particles at
the time of the snapshot
˃ 9 snapshots, consists of 321,065,547 particles
» Bixi Dataset
˃ Data from a bicycle-renting service in the city of
Montreal
˃ Every minute, the statistic information about bike usage
a station is collected by the sensor
9/20/2012
˃ 100,800 timestamps, consists of 404 stations
MESOCA 2012 8
9. Three Schemas
for the Cosmology Dataset
Schema/ Row Family: Version
dimension Column
Schema1 sid-type-pid particle No meaning
Z
properties
Y
Schema2 type-pid particle Snapshot id
X
properties
Schema3 type-reversedpid particle Snapshot id
properties
Schema1 Schema2 Schema3
Region 24-2-33446666 2-33446666 2-00005533
9/20/2012
Region 64-2-33559999 2-33550000 2-66664433
Region 84-2-33550000 2-33559999 2-99995533
MESOCA 2012 9
11. Three Schemas
for the Bixi Dataset
Schema/ Row Family: Column Version
dimension
Schema1 hour-sid minutes[0,59] no meaning
Schema2 hour-sid monitoring metrics minutes [0,59]
Schema3 day-sid monitoring metrics minutes [0,1439]
Schema1 Schema2 Schema3
Time
Time metrics Time
X
X metrics
9/20/2012
X
MESOCA 2012 11
12. The Bixi dataset
» A period of 70 days, from Sep 24, 2010 to Dec 1, 2010,
» 100,800 timestamps
» 404 stations involved
» Stored in XML file
9/20/2012
MESOCA 2012 12
13. Experiment Results
» Experiment Environment
˃ A four-node cluster on virtual machines with Ubuntu
˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support)
˃ HBase Configuration
+ The replication factor of 2
+ 5KB Caching Size
» Queries for each dataset
˃ Three queries of Cosmology dataset from related research
˃ One query of Bixi dataset from business requirement
» Query processing Implementation
9/20/2012
˃ Native java API
˃ User-Level Coprocessor Implementation
MESOCA 2012 13
14. Query1 of Cosmology Dataset
» Get all the particles of a type: star
» in a single snapshot
» with a given property: tform
» whose property matches the expression
˃ [>0.01;84]
˃ [>0.08;128]
˃ [>0.05;128]
˃ [>0.08;216]
˃ [>0.08;512]
9/20/2012
MESOCA 2012 14
15. Query2 of Cosmology Dataset
» Get all the particles added/destroyed
» between s1 and s2
˃ [29;24]
˃ [60;24]
˃ [84;24]
˃ [128;24]
˃ [216;128]
˃ [216;24]
˃ [512;24]
˃ [512;128]
˃ [512;216]
9/20/2012
MESOCA 2012 16
16. Query3 of Cosmology Dataset
» Get the values of a property
» for a set of particle IDs
» across the selected snapshots
˃10;[24]
˃10; [24,512],
˃10;[24,60,128,512]
˃10;[24,29,60,84,128,512]
˃10;[24,36,45,60,84,128,216,512]
˃50;[24,29,84,512]
˃50;[24,29,36,45,60,84,128,216,512]
˃100;[24,29,36,45,60,84,128,216,512]
9/20/2012
˃150;[24,29,36,45,60,84,128,216,512
MESOCA 2012 18
17. Query3 of Cosmology Dataset
» Get the values of a property: star:eps
» for a set of particle IDs: a continuous range particle IDs
» across the selected snapshots
˃ 10;[24]
˃ 10; [24,512],
˃ 10;[24,60,128,512]
˃ 10;[24,29,60,84,128,512]
˃ 10;[24,36,45,60,84,128,216,512]
˃ 50;[24,29,84,512]
˃ 50;[24,29,36,45,60,84,128,216,512]
˃ 100;[24,29,36,45,60,84,128,216,512]
9/20/2012
˃ 150;[24,29,36,45,60,84,128,216,512]
MESOCA 2012 19
18. Bixi Query
» For a given list of stations: 200 stations
» get average bike usage in a given period
˃ [1day]
˃ [2day]
˃ [4day]
˃ [8day]
˃ [16day]
9/20/2012
MESOCA 2012 21
19. Discussion
» “Qualitative” versus “Quantitative” Suggestions
» Dynamic Data versus Static Data
» Historical Dataset versus Real-Time Datasets
» Supported versus Non-Supported Datasets
9/20/2012
MESOCA 2012 23
20. Conclusion
» The objective is to make queries local
» To do that, you have to design the right key, so that all
queries traverse a range of keys
˃With all answers in them
˃With not much irrelevant data in it
» But, hotspotting occurs when
˃???
9/20/2012
MESOCA 2012 24
21. Conclusion
» A 3-dimensional data model
˃Improved performance can be got from the data schema
that use the version dimension of HBase
» Fit in “write-once, read-many” system
˃Monitoring system
˃Sensor-based system
˃Version-based analysis
9/20/2012
MESOCA 2012 25
22. Future Work
» More Evaluation of this data model
˃Scalability
˃Elasticity
˃Utilization
» How to design data model for other datasets
˃Spatial dataset
˃Graphic dataset
9/20/2012
MESOCA 2012 26