SlideShare une entreprise Scribd logo
1  sur  23
Dan Han, Eleni Stroulia
            University of Alberta
9/20/2012




                MESOCA 2012           1
Outline
            »   Background and Motivation
            »   Related Work
            »   A 3-Dimensional Data Model in HBase
            »   Case Study and Experiment Results
            »   Discussion
            »   Conclusions and Future Work
9/20/2012




                 MESOCA 2012                          2
Migrating Applications
            To the Cloud
            » Cloud is an attractive computing platform
               ˃ Elasticity, Excellent Scalability, High Availability, Low Operating
                 Cost

            » Applications are moving to the cloud
               ˃ Social networking, online shopping, monitoring system
               ˃ Time-Series data: grows monotonously over time
               ˃ Analysis of large scale time-series data
                    + May lead to new knowledge
                    + May lead to improvements of existing services


            » Success adoption of this movement paradigm requires a
9/20/2012




              new model of storage

                MESOCA 2012                                                        3
Migrating RDBMS Content
            To NoSQL
            » From RDBMS to NoSQL storage systems
               ˃ Enable the storage of big data, in order of row key
               ˃ Scale horizontally across storage nodes easily
               ˃ Not much data-organization support


            » Migration challenges
               ˃ Few experiences and principles to follow
               ˃ Steep learning curve for programming
               ˃ Much experimentation is required before deployment
                    + Much time is spent in designing the data schema
                    + The “wrong” schema may lead to inefficient, high-latency queries
9/20/2012




                MESOCA 2012                                                              4
We need Design Patterns for
            HBase Schemas
            » Our objective is to develop a systematic method for
               ˃ Guiding data organization in NoSQL databases, given
               ˃ the types of data stored
               ˃ the amount of data
               ˃ The data-usage patterns


            » We start our investigation with HBase
               ˃ A NoSQL database offering, built on top of Hadoop
               ˃ Parallel Distributed Computation
                   + MapReduce Framework
                   + Coprocessor Framework
9/20/2012




               MESOCA 2012                                             5
Related Work
            » Talks in HBaseCon2012, held in May
               ˃ Data schema and Coprocessor are two main topics
               ˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe


            » Organizing time-series data in period-specific “buckets”
               ˃ OpenTSDB: a distributed scalable time-series database, on top of
                 HBase
               ˃ A data Model in Cassandra, another NoSQL database offering
               ˃ Applied in our case study
9/20/2012




                   MESOCA 2012                                                      6
Data Organization in HBase
             » Cell in HBase
                  ˃(Row, Family: Column, Version) => (X,Y,Z) = value
                            Y                   Z
                                                    Y
                      X                  VS     X




            Schema/   Row                     Family: Column       Version
            dimension
            2-D            unique id -        varying properties   current
                           timestamp                               timestamp
9/20/2012




            3-D            unique id          varying properties   timestamps

                  MESOCA 2012                                                   7
Case study:
             The Datasets
            » Cosmology Dataset
               ˃ Product of an N-body simulation
               ˃ Three types of particles: dark matter, gas and star
               ˃ Particles evolve over a series of discrete timestamps
               ˃ Each snapshot records the properties of all particles at
                 the time of the snapshot
               ˃ 9 snapshots, consists of 321,065,547 particles
            » Bixi Dataset
               ˃ Data from a bicycle-renting service in the city of
                 Montreal
               ˃ Every minute, the statistic information about bike usage
                 a station is collected by the sensor
9/20/2012




               ˃ 100,800 timestamps, consists of 404 stations

                  MESOCA 2012                                               8
Three Schemas
              for the Cosmology Dataset
            Schema/     Row                  Family:         Version
            dimension                        Column
            Schema1     sid-type-pid         particle        No meaning
                                                                               Z
                                             properties
                                                                                   Y
            Schema2     type-pid             particle        Snapshot id
                                                                               X
                                             properties
            Schema3     type-reversedpid     particle        Snapshot id
                                             properties

                                 Schema1        Schema2           Schema3
               Region        24-2-33446666      2-33446666        2-00005533
9/20/2012




               Region        64-2-33559999      2-33550000        2-66664433

               Region       84-2-33550000       2-33559999        2-99995533
                   MESOCA 2012                                                         9
The cosmology dataset
            » Dataset called“cosmo50”
              ˃ 9 snapshots
                               S-ID   Star Particles   Total particles

                               24              1,291       33,555,723
                               29              5,568       33,559,998
                               36             20,246       33,574,630
                               45             67,268       33,620,890
                               60            259,219       33,800,108
                               84            907,025       34,369,014
                               128         2,743,966       35,908,164
9/20/2012




                               216         6,396,955       38,889,220
                               512        12,417,544       43,787,800

               MESOCA 2012                                               10
Three Schemas
            for the Bixi Dataset
            Schema/           Row        Family: Column           Version
            dimension
            Schema1           hour-sid   minutes[0,59]            no meaning

            Schema2           hour-sid   monitoring metrics       minutes [0,59]

            Schema3           day-sid    monitoring metrics       minutes [0,1439]


                         Schema1          Schema2             Schema3
                                         Time
                         Time               metrics                 Time
                   X
                                         X                        metrics
9/20/2012




                                                              X


                MESOCA 2012                                                          11
The Bixi dataset
            »   A period of 70 days, from Sep 24, 2010 to Dec 1, 2010,
            »   100,800 timestamps
            »   404 stations involved
            »   Stored in XML file
9/20/2012




                   MESOCA 2012                                      12
Experiment Results
            » Experiment Environment
               ˃ A four-node cluster on virtual machines with Ubuntu
               ˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support)
               ˃ HBase Configuration
                    + The replication factor of 2
                    + 5KB Caching Size


            » Queries for each dataset
               ˃ Three queries of Cosmology dataset from related research
               ˃ One query of Bixi dataset from business requirement


            » Query processing Implementation
9/20/2012




               ˃ Native java API
               ˃ User-Level Coprocessor Implementation

                MESOCA 2012                                                 13
Query1 of Cosmology Dataset
            »   Get all the particles of a type: star
            »   in a single snapshot
            »   with a given property: tform
            »   whose property matches the expression
                ˃ [>0.01;84]
                ˃ [>0.08;128]
                ˃ [>0.05;128]
                ˃ [>0.08;216]
                ˃ [>0.08;512]
9/20/2012




                    MESOCA 2012                         14
Query2 of Cosmology Dataset
            » Get all the particles added/destroyed
            » between s1 and s2
               ˃ [29;24]
               ˃ [60;24]
               ˃ [84;24]
               ˃ [128;24]
               ˃ [216;128]
               ˃ [216;24]
               ˃ [512;24]
               ˃ [512;128]
               ˃ [512;216]
9/20/2012




                   MESOCA 2012                        16
Query3 of Cosmology Dataset
            » Get the values of a property
            » for a set of particle IDs
            » across the selected snapshots
               ˃10;[24]
               ˃10; [24,512],
               ˃10;[24,60,128,512]
               ˃10;[24,29,60,84,128,512]
               ˃10;[24,36,45,60,84,128,216,512]
               ˃50;[24,29,84,512]
               ˃50;[24,29,36,45,60,84,128,216,512]
               ˃100;[24,29,36,45,60,84,128,216,512]
9/20/2012




               ˃150;[24,29,36,45,60,84,128,216,512
                  MESOCA 2012                         18
Query3 of Cosmology Dataset
            » Get the values of a property: star:eps
            » for a set of particle IDs: a continuous range particle IDs
            » across the selected snapshots
               ˃ 10;[24]
               ˃ 10; [24,512],
               ˃ 10;[24,60,128,512]
               ˃ 10;[24,29,60,84,128,512]
               ˃ 10;[24,36,45,60,84,128,216,512]
               ˃ 50;[24,29,84,512]
               ˃ 50;[24,29,36,45,60,84,128,216,512]
               ˃ 100;[24,29,36,45,60,84,128,216,512]
9/20/2012




               ˃ 150;[24,29,36,45,60,84,128,216,512]

                   MESOCA 2012                                             19
Bixi Query
            » For a given list of stations: 200 stations
            » get average bike usage in a given period
               ˃ [1day]
               ˃ [2day]
               ˃ [4day]
               ˃ [8day]
               ˃ [16day]
9/20/2012




                   MESOCA 2012                             21
Discussion
            »   “Qualitative” versus “Quantitative” Suggestions
            »   Dynamic Data versus Static Data
            »   Historical Dataset versus Real-Time Datasets
            »   Supported versus Non-Supported Datasets
9/20/2012




                 MESOCA 2012                                      23
Conclusion
            » The objective is to make queries local
            » To do that, you have to design the right key, so that all
              queries traverse a range of keys
               ˃With all answers in them
               ˃With not much irrelevant data in it
            » But, hotspotting occurs when
               ˃???
9/20/2012




                MESOCA 2012                                         24
Conclusion
            » A 3-dimensional data model
               ˃Improved performance can be got from the data schema
                that use the version dimension of HBase
            » Fit in “write-once, read-many” system
               ˃Monitoring system
               ˃Sensor-based system
               ˃Version-based analysis
9/20/2012




               MESOCA 2012                                             25
Future Work
            » More Evaluation of this data model
               ˃Scalability
               ˃Elasticity
               ˃Utilization
            » How to design data model for other datasets
               ˃Spatial dataset
               ˃Graphic dataset
9/20/2012




               MESOCA 2012                                  26
Questions?

                          Thank you
9/20/2012




            MESOCA 2012                27

Contenu connexe

Tendances

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축Kwang Woo NAM
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOPShital Kat
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 

Tendances (20)

Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
[FOSS4G KOREA 2014]Hadoop 상에서 MapReduce를 이용한 Spatial Big Data 집계와 시스템 구축
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 

En vedette

Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case StudiesEvan Liu
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modelingAdam Hutson
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional modelGersiton Pila Challco
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modelingaksrauf
 

En vedette (20)

Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modeling
 
Spark!
Spark!Spark!
Spark!
 
Designing the business process dimensional model
Designing the business process dimensional modelDesigning the business process dimensional model
Designing the business process dimensional model
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
H base key design
H base key designH base key design
H base key design
 
Are you Kudu-ing me?!
Are you Kudu-ing me?!Are you Kudu-ing me?!
Are you Kudu-ing me?!
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 

Similaire à A 3 dimensional data model in hbase for large time-series dataset-20120915

Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storagejccastrejon
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptChris Richardson
 
Using postgre sql for 3d cms
Using postgre sql for 3d cmsUsing postgre sql for 3d cms
Using postgre sql for 3d cmsTim Child
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 
Seminar.2010.NoSql
Seminar.2010.NoSqlSeminar.2010.NoSql
Seminar.2010.NoSqlroialdaag
 
Kliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalKliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalIGN Vorstand
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09Hiroshi Ono
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainChristophe Debruyne
 
Identifying Auxiliary Web Images Using Combinations of Analyses
Identifying Auxiliary Web Images Using Combinations of AnalysesIdentifying Auxiliary Web Images Using Combinations of Analyses
Identifying Auxiliary Web Images Using Combinations of AnalysesTewson Seeoun
 
Summit 2011 infra_dbms
Summit 2011 infra_dbmsSummit 2011 infra_dbms
Summit 2011 infra_dbmsPini Cohen
 
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...DataStax
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Creating a Tabular Model Using SQL Server 2012 Analysis Services
Creating a Tabular Model Using SQL Server 2012 Analysis ServicesCreating a Tabular Model Using SQL Server 2012 Analysis Services
Creating a Tabular Model Using SQL Server 2012 Analysis ServicesCode Mastery
 
SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?Chris Richardson
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDCAstroAtom
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL DatabaseSteve Min
 

Similaire à A 3 dimensional data model in hbase for large time-series dataset-20120915 (20)

Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, EgyptSQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
 
Using postgre sql for 3d cms
Using postgre sql for 3d cmsUsing postgre sql for 3d cms
Using postgre sql for 3d cms
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
No Sql
No SqlNo Sql
No Sql
 
Seminar.2010.NoSql
Seminar.2010.NoSqlSeminar.2010.NoSql
Seminar.2010.NoSql
 
Kliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_finalKliment oggioni ppt_gi2011_env_europe_remote_final
Kliment oggioni ppt_gi2011_env_europe_remote_final
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 
Identifying Auxiliary Web Images Using Combinations of Analyses
Identifying Auxiliary Web Images Using Combinations of AnalysesIdentifying Auxiliary Web Images Using Combinations of Analyses
Identifying Auxiliary Web Images Using Combinations of Analyses
 
Summit 2011 infra_dbms
Summit 2011 infra_dbmsSummit 2011 infra_dbms
Summit 2011 infra_dbms
 
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...
Cassandra-Based Image Processing: Two Case Studies (Kerry Koitzsch, Kildane) ...
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Creating a Tabular Model Using SQL Server 2012 Analysis Services
Creating a Tabular Model Using SQL Server 2012 Analysis ServicesCreating a Tabular Model Using SQL Server 2012 Analysis Services
Creating a Tabular Model Using SQL Server 2012 Analysis Services
 
SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?SQL, NoSQL, NewSQL? What's a developer to do?
SQL, NoSQL, NewSQL? What's a developer to do?
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDC
 
NoSQL Database
NoSQL DatabaseNoSQL Database
NoSQL Database
 

Dernier

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

A 3 dimensional data model in hbase for large time-series dataset-20120915

  • 1. Dan Han, Eleni Stroulia University of Alberta 9/20/2012 MESOCA 2012 1
  • 2. Outline » Background and Motivation » Related Work » A 3-Dimensional Data Model in HBase » Case Study and Experiment Results » Discussion » Conclusions and Future Work 9/20/2012 MESOCA 2012 2
  • 3. Migrating Applications To the Cloud » Cloud is an attractive computing platform ˃ Elasticity, Excellent Scalability, High Availability, Low Operating Cost » Applications are moving to the cloud ˃ Social networking, online shopping, monitoring system ˃ Time-Series data: grows monotonously over time ˃ Analysis of large scale time-series data + May lead to new knowledge + May lead to improvements of existing services » Success adoption of this movement paradigm requires a 9/20/2012 new model of storage MESOCA 2012 3
  • 4. Migrating RDBMS Content To NoSQL » From RDBMS to NoSQL storage systems ˃ Enable the storage of big data, in order of row key ˃ Scale horizontally across storage nodes easily ˃ Not much data-organization support » Migration challenges ˃ Few experiences and principles to follow ˃ Steep learning curve for programming ˃ Much experimentation is required before deployment + Much time is spent in designing the data schema + The “wrong” schema may lead to inefficient, high-latency queries 9/20/2012 MESOCA 2012 4
  • 5. We need Design Patterns for HBase Schemas » Our objective is to develop a systematic method for ˃ Guiding data organization in NoSQL databases, given ˃ the types of data stored ˃ the amount of data ˃ The data-usage patterns » We start our investigation with HBase ˃ A NoSQL database offering, built on top of Hadoop ˃ Parallel Distributed Computation + MapReduce Framework + Coprocessor Framework 9/20/2012 MESOCA 2012 5
  • 6. Related Work » Talks in HBaseCon2012, held in May ˃ Data schema and Coprocessor are two main topics ˃ Experience from 30 enterprises, i.e., Facebook, Yapmap, eBay, Adobe » Organizing time-series data in period-specific “buckets” ˃ OpenTSDB: a distributed scalable time-series database, on top of HBase ˃ A data Model in Cassandra, another NoSQL database offering ˃ Applied in our case study 9/20/2012 MESOCA 2012 6
  • 7. Data Organization in HBase » Cell in HBase ˃(Row, Family: Column, Version) => (X,Y,Z) = value Y Z Y X VS X Schema/ Row Family: Column Version dimension 2-D unique id - varying properties current timestamp timestamp 9/20/2012 3-D unique id varying properties timestamps MESOCA 2012 7
  • 8. Case study: The Datasets » Cosmology Dataset ˃ Product of an N-body simulation ˃ Three types of particles: dark matter, gas and star ˃ Particles evolve over a series of discrete timestamps ˃ Each snapshot records the properties of all particles at the time of the snapshot ˃ 9 snapshots, consists of 321,065,547 particles » Bixi Dataset ˃ Data from a bicycle-renting service in the city of Montreal ˃ Every minute, the statistic information about bike usage a station is collected by the sensor 9/20/2012 ˃ 100,800 timestamps, consists of 404 stations MESOCA 2012 8
  • 9. Three Schemas for the Cosmology Dataset Schema/ Row Family: Version dimension Column Schema1 sid-type-pid particle No meaning Z properties Y Schema2 type-pid particle Snapshot id X properties Schema3 type-reversedpid particle Snapshot id properties Schema1 Schema2 Schema3 Region 24-2-33446666 2-33446666 2-00005533 9/20/2012 Region 64-2-33559999 2-33550000 2-66664433 Region 84-2-33550000 2-33559999 2-99995533 MESOCA 2012 9
  • 10. The cosmology dataset » Dataset called“cosmo50” ˃ 9 snapshots S-ID Star Particles Total particles 24 1,291 33,555,723 29 5,568 33,559,998 36 20,246 33,574,630 45 67,268 33,620,890 60 259,219 33,800,108 84 907,025 34,369,014 128 2,743,966 35,908,164 9/20/2012 216 6,396,955 38,889,220 512 12,417,544 43,787,800 MESOCA 2012 10
  • 11. Three Schemas for the Bixi Dataset Schema/ Row Family: Column Version dimension Schema1 hour-sid minutes[0,59] no meaning Schema2 hour-sid monitoring metrics minutes [0,59] Schema3 day-sid monitoring metrics minutes [0,1439] Schema1 Schema2 Schema3 Time Time metrics Time X X metrics 9/20/2012 X MESOCA 2012 11
  • 12. The Bixi dataset » A period of 70 days, from Sep 24, 2010 to Dec 1, 2010, » 100,800 timestamps » 404 stations involved » Stored in XML file 9/20/2012 MESOCA 2012 12
  • 13. Experiment Results » Experiment Environment ˃ A four-node cluster on virtual machines with Ubuntu ˃ Hadoop 0.20, HBase 0.93-snapshot (Coprocessor support) ˃ HBase Configuration + The replication factor of 2 + 5KB Caching Size » Queries for each dataset ˃ Three queries of Cosmology dataset from related research ˃ One query of Bixi dataset from business requirement » Query processing Implementation 9/20/2012 ˃ Native java API ˃ User-Level Coprocessor Implementation MESOCA 2012 13
  • 14. Query1 of Cosmology Dataset » Get all the particles of a type: star » in a single snapshot » with a given property: tform » whose property matches the expression ˃ [>0.01;84] ˃ [>0.08;128] ˃ [>0.05;128] ˃ [>0.08;216] ˃ [>0.08;512] 9/20/2012 MESOCA 2012 14
  • 15. Query2 of Cosmology Dataset » Get all the particles added/destroyed » between s1 and s2 ˃ [29;24] ˃ [60;24] ˃ [84;24] ˃ [128;24] ˃ [216;128] ˃ [216;24] ˃ [512;24] ˃ [512;128] ˃ [512;216] 9/20/2012 MESOCA 2012 16
  • 16. Query3 of Cosmology Dataset » Get the values of a property » for a set of particle IDs » across the selected snapshots ˃10;[24] ˃10; [24,512], ˃10;[24,60,128,512] ˃10;[24,29,60,84,128,512] ˃10;[24,36,45,60,84,128,216,512] ˃50;[24,29,84,512] ˃50;[24,29,36,45,60,84,128,216,512] ˃100;[24,29,36,45,60,84,128,216,512] 9/20/2012 ˃150;[24,29,36,45,60,84,128,216,512 MESOCA 2012 18
  • 17. Query3 of Cosmology Dataset » Get the values of a property: star:eps » for a set of particle IDs: a continuous range particle IDs » across the selected snapshots ˃ 10;[24] ˃ 10; [24,512], ˃ 10;[24,60,128,512] ˃ 10;[24,29,60,84,128,512] ˃ 10;[24,36,45,60,84,128,216,512] ˃ 50;[24,29,84,512] ˃ 50;[24,29,36,45,60,84,128,216,512] ˃ 100;[24,29,36,45,60,84,128,216,512] 9/20/2012 ˃ 150;[24,29,36,45,60,84,128,216,512] MESOCA 2012 19
  • 18. Bixi Query » For a given list of stations: 200 stations » get average bike usage in a given period ˃ [1day] ˃ [2day] ˃ [4day] ˃ [8day] ˃ [16day] 9/20/2012 MESOCA 2012 21
  • 19. Discussion » “Qualitative” versus “Quantitative” Suggestions » Dynamic Data versus Static Data » Historical Dataset versus Real-Time Datasets » Supported versus Non-Supported Datasets 9/20/2012 MESOCA 2012 23
  • 20. Conclusion » The objective is to make queries local » To do that, you have to design the right key, so that all queries traverse a range of keys ˃With all answers in them ˃With not much irrelevant data in it » But, hotspotting occurs when ˃??? 9/20/2012 MESOCA 2012 24
  • 21. Conclusion » A 3-dimensional data model ˃Improved performance can be got from the data schema that use the version dimension of HBase » Fit in “write-once, read-many” system ˃Monitoring system ˃Sensor-based system ˃Version-based analysis 9/20/2012 MESOCA 2012 25
  • 22. Future Work » More Evaluation of this data model ˃Scalability ˃Elasticity ˃Utilization » How to design data model for other datasets ˃Spatial dataset ˃Graphic dataset 9/20/2012 MESOCA 2012 26
  • 23. Questions? Thank you 9/20/2012 MESOCA 2012 27