SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Nisheet Mahajan
Data Storage and Management
NATIONAL COLLEGE OF
IRELAND
Comparison Between HBase and Cassandra using YCSB
Nisheet Mahajan
X16133099
Data Storage and Management
INTRODUCTION
In today’ era with the massive generation of data by the users, a new database management
system which is faster and simpler to access billions of data is required. Big data consists of
three V’s (the dimension): Velocity, volume and variety with volume refencing as the
amount of data an organization has stored, velocity referencing as the speed with which
data is generated and analyzed and variety referencing to varieties of data that can be
generated in real world. Volume and variety of data causes lot of problem to an
organization in RDBMS(Relational Database Management System) world. Because of which
new systems have been introduced for storage and management which includes:
Cassandra, Voldemort, Hbase and others, also referred to as NoSql databases. They have
the capacity to index and store massive data sets as per the user requests and to trade off
past consistency for other properties which are much more useful.
The basic concern for any data related system is high performance. Nosql performs better
than relational database management system (RDBMS) in various cases. Researchers have
been looking out for a fully optimized database matching their use cases. To evaluate the
cloud DB’s performance on a common set of workloads Yahoo presented a framework
named as YCSB (Yahoo cloud Serving Benchmark). Yahoo cloud Serving Benchmark is an
extensible workload generator- define new workloads to test system aspects and to
execute the Core workloads. It is commonly used to compare and benchmark multiple
systems. The two NoSQL databases we will considering for our research work will be
Hbase and Cassandra. The combination posses durability initializing from logging all
[write] operations to a file. HBase and Cassandra both prevent loss of data caused of cluster
nodes failure while replication.
Characteristics of HBase:
It is sorted map, sparse, distributed and consistent. The indexing in Hbase happens in row
key, timestamp and column family. Data is stored in tables which organize it into rows and
the rows have a unique row key for identification. The following are the basic components
for HBase :
HBase is linearly scalable with an automatic failure support.
It provides consistent read and writes with easy java API for client.
Replicates data across clusters.
Column Families :
These are defined at the time of creation of schema of the table as they are not easy to
modify.
Column Qualifier :
The column family contains massive no. of column qualifier and treated as byte array
Characteristic of Cassandra :
It is an open source NoSQL database offering operational simplicity, linear scale
performance , easy data distribution with continuous availability. Cassandra is can run on
different machines with no single point of failure. It has peer to peer architecture without
master slave issue.
✓ Linear scale performance – Nodes are added producing increases in performance.
✓ Continuous availability – The data and node function offers redundancy and gives
constant uptime.
✓ Transparent fault detection and recovery – The failed node can be restored or
replaced.
✓ Flexible and dynamic data model – The data types support fast writes and reads.
✓ Strong data protection – a commit log design ensures no data loss and built in
security with backup/restore keeps data protected and safe.
✓ Tunable data consistency
✓ Multi-data center replication
✓ Column families:
Column families contain information of the column defined based on an application.
They are static (which are defined by Cassandra) and dynamic (which are defined
by the users) column family. There are few types of column family namely:
Standard(one primary key), composite(multiple primary keys),expiring(gets
deleted after sometime) and counter(keeping track of occurrence of events).
✓ CQL:
It is a primary and default interface simplifying data modelling.
Cassandra is designed to handle workloads, without any single failure, across many nodes.
The nodes in a cluster are independent and interconnected to each other. Each node can
accept read and write requests, irrespective of the location of data in cluster. Cassandra has
become the most reliable choice for business and technical stakeholders.
Validator (datatype of column) and comparator (datatype of column name) are two
datatypes in Cassandra which are defined at the time of creation of column families.
Database Architecture
HBase :
The architecture of HBase consists of tables which are divided into regions and served by
region servers. The regions are divided by the column families into Stores which are saved
as a file in HDFS.
The client library, the master server and the region server are the major components of the
HBase.
Hbase Master :
It is the master server responsible for monitoring all Region servers. HBase Master is
responsible for performing sharding (load balancing). HBase has the capacity to run
multiple Hbase master in cluster keeping only one active at a time. HBase master is
responsible for assigning regions to RegionServers. The metadata changes undergo
through Master.
RegionServer:
The machine running on the region server is considered as a worker node. The region
server is considered to be an implementation of worker module. It is responsible for
splitting and compacting regions running on a datanode. The Multiple Region Server runs
in a cluster.
Zookeeper:
Processes which are distributed can coordinate with each other with the help of a shared
hierarchal name space. Zookeeper is HBase is responsible for:
✓ Providing availability status of RegionServers
✓ Ensuring single active HMaster in the cluster
✓ Providing location of “-ROOT-“ table
✓ Selecting new HMaster in case of failure of active HMaster
Hbase gives the flexibility to the client to connect to any node in the cluster. To coordinate
with client and master mode HBase relies on Zookeeper.
Cassandra:
Cassandra has a node based , fault-tolerant , scalable and consistent architecture. Lowest
level in a Cassandra cluster is node and a Single instance is represented by a node.
Both the datacenters , nodes and racks comprise up to Cassandra architecture. It is a shared
nothing environment with no central controller
Data Partitioning:
A distributed database is partitioned across nodes as it divides data equally around its
cluster of nodes.
Data Replication:
Multiple nodes in a cluster behave as the replicas for a given piece of data. If an out of date
value is responded by the nodes, Cassandra returns the most recent value of node.
Key spaces:
Cassandra creates one keyspace which stores the column families and data.
Node & data center:
A collection of related nodes with a place to store nodes.
Cluster:
It contains one or more than one data centers
Mem-Table:
it is a popup utility data structure and has multiple mem-tables for a single column family.
Commit Log :
Crash–recovery mechanism in Cassandra
SSTable:
It is a disk file to which the data is flushed from the mem-table when its contents reach a
threshold value
Bloom filter:
They are nondeterministic, quick , algorithm to check whether an element is member or
not of the set. Bloom filters are cache which are accessed post every query.
Comparison between HBase and Cassandra in terms of Scalability, Availability and
Reliability:
The analyzation is based on the CAP theory :
•Scalability:
Casandra and HBase are scalable databases as the rows and column families are descried in
advance it is easier to add new columns on the fly.Cassandra accomplishes direct
adaptability by adding nodes in the cluster and the framework is devised to the point that
the cluster will use the newly added resource.
•Availability:
Cassandra is opted out best In terms of availability as it has consistent database solution.
Casandra has a node distributed architecture as a result the data is replicated over the
nodes and if any node fails down Cassandra generates a response thus making it highly
available.
•Reliability:
HBase has features like Hadoop support and range based row scans which makes it more
reliable than Cassandra. HBase meets the consistency and Partitioning of CAP theory as
well as it is strongly consistent as well.
Performance Test Plan
Physical Machine
Processor : 2.40 GHz Intel Core i5 (64-bit)
Number of Cores : 2
Memory: 8GB
Operating system : Microsoft Windows 7 Professional
Virtualization Software : VirtualBox 5.1.12
HBase virtual machine -
Operating system : Ubuntu (64 bit)
Memory : 4GB
Processor :1
Cassandra Virtual Machine –
Operating system : Ubuntu (64 bit)
Memory : 4GB
Processor :1
Benchmarking Application –
Yahoo! Cloud Serving Benchmark, 0.11.0
Evaluation and Results:
The test is performed against the HBase and Cassandra database with YCSB benchmarking
operations- Operation A and operation D. To calculate the various test runs with results,
the average is calculated for three sets of results.
YCSB Workload A(Read Evaluation)
In the read operations, there’s a small drop in average latency for HBase between 40000 to
60000 read operations and the average latency rises for HBase in between 60000 to 80000
as shown in the graph above.
491.02
437.71
812.25
688.81
515.68
685.64
562.19 583.13
547.56 568.72
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000 120000 140000 160000
ReadAverageLatency
Read Operations
Read Operations against Read Average Latency
Hbase
Cassandra
Workload A
HBase Cassandra
Read operations Read Average latency Read Operations Read Average latency
25135 491.02 25033 685.64
50030 437.71 50142 562.19
75030 812.25 74938 583.13
100204 688.81 99511 547.56
149929 515.68 149704 568.72
YCSB Workload A(Update Evaluation)
787.25
741.7
1318.7
1102.26
849.29
620.71
505.14 519.04 486.13 501.23
0
200
400
600
800
1000
1200
1400
0 20000 40000 60000 80000 100000 120000 140000 160000
UpdateAverageLatency
Update Operations
Update operations against Update Average Latency
Hbase
Cassandra
Workload A
HBase Cassandra
Update operations Update Average latency Update Operations Update Average latency
24865 787.25 24967 620.71
49970 741.70 49858 505.14
74970 1318.70 75062 519.04
99796 1102.26 100489 486.13
150071 849.29 150296 501.23
There’s a subsequent rise in the average latency for HBase between 60000 to 80000 as
seen from the graph above. For Cassandra the average latency falls down between 40000 to
60000.
YCSB Workload D (Read Evaluation)
In the workload D, the graph depicts the average latency for HBase is increasing whereas
for Cassandra it is decreasing.
332.95
312.76
341.2 335.17
442.95
605.02
548.78
516.91
476.41
503.77
0
100
200
300
400
500
600
700
0 50000 100000 150000 200000 250000 300000
ReadAverageLatency
Read Operations
Read Operation against Read Average Latency
Hbase
Cassandra
Workload D
HBase Cassandra
Read operations Read Average latency Read Operations Read Average latency
47429 332.95 47517 605.02
95066 312.76 95039 548.78
142621 341.20 142689 516.91
189955 335.17 189883 476.41
284994 442.95 285269 503.77
YSCB Workload D (Insert Evaluation)
1509.59
1302.26 1333.54
1192.89
1444.81
831.84
722.43 690.06
623.41 631.53
0
200
400
600
800
1000
1200
1400
1600
0 2000 4000 6000 8000 10000 12000 14000 16000
InsertAverageLatency
Isert Operations
Insert Operations against Insert Average Latency
HBase
Cassandra
Workload D
HBase Cassandra
Insert operations Insert Average latency Insert Operations Insert Average latency
2571 1509.59 2483 831.84
4932 1302.26 4961 722.43
7379 1333.54 7311 690.06
10045 1192.89 10017 623.41
15006 1444.81 14731 631.53
For the insert operations, the average latency for Hbase falls and then rises as
depicted in the graph whereas for Cassandra it depletes down and maintains a
approx. constant value .
Records VS Throughput Workload A
1464.3
1617.39
895.23
1082.96
1429.33
1363.42
1735.36 1719.52
1841.37 1813.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 50000 100000 150000 200000 250000 300000 350000
Throughput
Records
Records against throughput Workload A
Hbase
Cassandra
Workload A
HBase Cassandra
Records Throughput Records Throughput
50000 1464.30 50000 1363.42
100000 1617.39 100000 1735.36
150000 895.23 150000 1719.52
200000 1082.96 200000 1841.37
300000 1429.33 300000 1813.50
As we can see from the graph as the records increase there is an effect on the throughput.
Records VS Throughput Workload D
Conclusion:
Both the databases have their own capability and are used for storing and accessing data.
Both of them have their own advantages and disadvantages being efficient in their own
fields but from the research above it looks like Cassandra is much more efficient than
2074.51
2554.66
2324.68
2456.54
1952.34
1436.3
1665.72
1780.98
1961.93 1901.99
0
500
1000
1500
2000
2500
3000
0 50000 100000 150000 200000 250000 300000 350000
Throughput
Records
Records against throughput Workload D
HBase Cassandra
Workload D
HBase Cassandra
Records Throughput Records Throughput
50000 2074.51 50000 1436.30
100000 2554.66 100000 1665.72
150000 2324.68 150000 1780.98
200000 2456.54 200000 1961.93
300000 1952.34 300000 1901.99
Hbase. Cassandra has been constant in any operations without much getting effected with
the latency or no. of records. So with the research above we can say Cassandra is much
more stable than Hbase.
References :
1. Apache Cassandra. http://incubator.apache.org/cassandra/
2. Google App Engine. http://appengine.google.com
3. SQL Data Services/Azure Services Platform.
http://www.microsoft.com/azure/data.mspx.
4. Storage Performance Council. http://www.storageperformance.org/home.
5. Yahoo! Query Language. http://developer.yahoo.com/yql/.
A. Arasu et al. Linear Road: a stream data management benchmark. In VLDB,
2004.
6. F. C. Botelho, D. Belazzougui, and M. Dietzfelbinger. Compress, hash and displace. In
Proc. of the 17th European Symposium on Algorithms, 2009.
7. B. White et al. An integrated experimental environment for distributed systems and
networks. In OSDI, 2002.
8. K. Yocum et al. Scalability and accuracy in a large-scale network emulator. In OSDI,
2002

Contenu connexe

Tendances

Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overviewPritamKathar
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraShrikant Samarth
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsJeff Harris
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 
Evaluating Apache Cassandra as a Cloud Database
Evaluating Apache Cassandra as a Cloud DatabaseEvaluating Apache Cassandra as a Cloud Database
Evaluating Apache Cassandra as a Cloud DatabaseDataStax
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0Asis Mohanty
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
Learning Cassandra NoSQL
Learning Cassandra NoSQLLearning Cassandra NoSQL
Learning Cassandra NoSQLPankaj Khattar
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 

Tendances (20)

Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and Cassandra
 
Cassandra Architecture FTW
Cassandra Architecture FTWCassandra Architecture FTW
Cassandra Architecture FTW
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Nosql
NosqlNosql
Nosql
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applications
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 
Evaluating Apache Cassandra as a Cloud Database
Evaluating Apache Cassandra as a Cloud DatabaseEvaluating Apache Cassandra as a Cloud Database
Evaluating Apache Cassandra as a Cloud Database
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Learning Cassandra NoSQL
Learning Cassandra NoSQLLearning Cassandra NoSQL
Learning Cassandra NoSQL
 
Apache Cassandra
Apache CassandraApache Cassandra
Apache Cassandra
 
No sql
No sqlNo sql
No sql
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 

Similaire à Data Storage Management

Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project ReportTushar Dalvi
 
04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdfhothyfa
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppthothyfa
 
Benchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxBenchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBKaushik Rajan
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandraNavanit Katiyar
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptxRushikeshChikane2
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataChen Robert
 
Oracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBaseOracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBasePaulo Fagundes
 
Column db dol
Column db dolColumn db dol
Column db dolpoojabi
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
 

Similaire à Data Storage Management (20)

Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf04-Introduction-to-CassandraDB-.pdf
04-Introduction-to-CassandraDB-.pdf
 
5266732.ppt
5266732.ppt5266732.ppt
5266732.ppt
 
Benchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxBenchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docx
 
Why Cassandra?
Why Cassandra?Why Cassandra?
Why Cassandra?
 
Cassandra Learning
Cassandra LearningCassandra Learning
Cassandra Learning
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
Hbase
HbaseHbase
Hbase
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 
Oracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBaseOracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBase
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Column db dol
Column db dolColumn db dol
Column db dol
 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
 

Dernier

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Dernier (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Data Storage Management

  • 1. Nisheet Mahajan Data Storage and Management NATIONAL COLLEGE OF IRELAND
  • 2. Comparison Between HBase and Cassandra using YCSB Nisheet Mahajan X16133099 Data Storage and Management
  • 3. INTRODUCTION In today’ era with the massive generation of data by the users, a new database management system which is faster and simpler to access billions of data is required. Big data consists of three V’s (the dimension): Velocity, volume and variety with volume refencing as the amount of data an organization has stored, velocity referencing as the speed with which data is generated and analyzed and variety referencing to varieties of data that can be generated in real world. Volume and variety of data causes lot of problem to an organization in RDBMS(Relational Database Management System) world. Because of which new systems have been introduced for storage and management which includes: Cassandra, Voldemort, Hbase and others, also referred to as NoSql databases. They have the capacity to index and store massive data sets as per the user requests and to trade off past consistency for other properties which are much more useful. The basic concern for any data related system is high performance. Nosql performs better than relational database management system (RDBMS) in various cases. Researchers have been looking out for a fully optimized database matching their use cases. To evaluate the cloud DB’s performance on a common set of workloads Yahoo presented a framework named as YCSB (Yahoo cloud Serving Benchmark). Yahoo cloud Serving Benchmark is an extensible workload generator- define new workloads to test system aspects and to execute the Core workloads. It is commonly used to compare and benchmark multiple systems. The two NoSQL databases we will considering for our research work will be Hbase and Cassandra. The combination posses durability initializing from logging all
  • 4. [write] operations to a file. HBase and Cassandra both prevent loss of data caused of cluster nodes failure while replication. Characteristics of HBase: It is sorted map, sparse, distributed and consistent. The indexing in Hbase happens in row key, timestamp and column family. Data is stored in tables which organize it into rows and the rows have a unique row key for identification. The following are the basic components for HBase : HBase is linearly scalable with an automatic failure support. It provides consistent read and writes with easy java API for client. Replicates data across clusters. Column Families : These are defined at the time of creation of schema of the table as they are not easy to modify. Column Qualifier : The column family contains massive no. of column qualifier and treated as byte array Characteristic of Cassandra : It is an open source NoSQL database offering operational simplicity, linear scale performance , easy data distribution with continuous availability. Cassandra is can run on different machines with no single point of failure. It has peer to peer architecture without master slave issue. ✓ Linear scale performance – Nodes are added producing increases in performance. ✓ Continuous availability – The data and node function offers redundancy and gives constant uptime. ✓ Transparent fault detection and recovery – The failed node can be restored or replaced. ✓ Flexible and dynamic data model – The data types support fast writes and reads. ✓ Strong data protection – a commit log design ensures no data loss and built in security with backup/restore keeps data protected and safe. ✓ Tunable data consistency ✓ Multi-data center replication
  • 5. ✓ Column families: Column families contain information of the column defined based on an application. They are static (which are defined by Cassandra) and dynamic (which are defined by the users) column family. There are few types of column family namely: Standard(one primary key), composite(multiple primary keys),expiring(gets deleted after sometime) and counter(keeping track of occurrence of events). ✓ CQL: It is a primary and default interface simplifying data modelling. Cassandra is designed to handle workloads, without any single failure, across many nodes. The nodes in a cluster are independent and interconnected to each other. Each node can accept read and write requests, irrespective of the location of data in cluster. Cassandra has become the most reliable choice for business and technical stakeholders. Validator (datatype of column) and comparator (datatype of column name) are two datatypes in Cassandra which are defined at the time of creation of column families. Database Architecture HBase : The architecture of HBase consists of tables which are divided into regions and served by region servers. The regions are divided by the column families into Stores which are saved as a file in HDFS.
  • 6. The client library, the master server and the region server are the major components of the HBase. Hbase Master : It is the master server responsible for monitoring all Region servers. HBase Master is responsible for performing sharding (load balancing). HBase has the capacity to run multiple Hbase master in cluster keeping only one active at a time. HBase master is responsible for assigning regions to RegionServers. The metadata changes undergo through Master. RegionServer: The machine running on the region server is considered as a worker node. The region server is considered to be an implementation of worker module. It is responsible for splitting and compacting regions running on a datanode. The Multiple Region Server runs in a cluster. Zookeeper: Processes which are distributed can coordinate with each other with the help of a shared hierarchal name space. Zookeeper is HBase is responsible for: ✓ Providing availability status of RegionServers ✓ Ensuring single active HMaster in the cluster ✓ Providing location of “-ROOT-“ table ✓ Selecting new HMaster in case of failure of active HMaster Hbase gives the flexibility to the client to connect to any node in the cluster. To coordinate with client and master mode HBase relies on Zookeeper. Cassandra: Cassandra has a node based , fault-tolerant , scalable and consistent architecture. Lowest level in a Cassandra cluster is node and a Single instance is represented by a node.
  • 7. Both the datacenters , nodes and racks comprise up to Cassandra architecture. It is a shared nothing environment with no central controller Data Partitioning: A distributed database is partitioned across nodes as it divides data equally around its cluster of nodes. Data Replication: Multiple nodes in a cluster behave as the replicas for a given piece of data. If an out of date value is responded by the nodes, Cassandra returns the most recent value of node. Key spaces: Cassandra creates one keyspace which stores the column families and data. Node & data center: A collection of related nodes with a place to store nodes. Cluster: It contains one or more than one data centers Mem-Table: it is a popup utility data structure and has multiple mem-tables for a single column family.
  • 8. Commit Log : Crash–recovery mechanism in Cassandra SSTable: It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value Bloom filter: They are nondeterministic, quick , algorithm to check whether an element is member or not of the set. Bloom filters are cache which are accessed post every query. Comparison between HBase and Cassandra in terms of Scalability, Availability and Reliability: The analyzation is based on the CAP theory : •Scalability: Casandra and HBase are scalable databases as the rows and column families are descried in advance it is easier to add new columns on the fly.Cassandra accomplishes direct adaptability by adding nodes in the cluster and the framework is devised to the point that the cluster will use the newly added resource. •Availability: Cassandra is opted out best In terms of availability as it has consistent database solution. Casandra has a node distributed architecture as a result the data is replicated over the nodes and if any node fails down Cassandra generates a response thus making it highly available. •Reliability: HBase has features like Hadoop support and range based row scans which makes it more reliable than Cassandra. HBase meets the consistency and Partitioning of CAP theory as well as it is strongly consistent as well.
  • 9. Performance Test Plan Physical Machine Processor : 2.40 GHz Intel Core i5 (64-bit) Number of Cores : 2 Memory: 8GB Operating system : Microsoft Windows 7 Professional Virtualization Software : VirtualBox 5.1.12 HBase virtual machine - Operating system : Ubuntu (64 bit) Memory : 4GB Processor :1 Cassandra Virtual Machine – Operating system : Ubuntu (64 bit) Memory : 4GB Processor :1 Benchmarking Application – Yahoo! Cloud Serving Benchmark, 0.11.0 Evaluation and Results: The test is performed against the HBase and Cassandra database with YCSB benchmarking operations- Operation A and operation D. To calculate the various test runs with results, the average is calculated for three sets of results.
  • 10. YCSB Workload A(Read Evaluation) In the read operations, there’s a small drop in average latency for HBase between 40000 to 60000 read operations and the average latency rises for HBase in between 60000 to 80000 as shown in the graph above. 491.02 437.71 812.25 688.81 515.68 685.64 562.19 583.13 547.56 568.72 0 100 200 300 400 500 600 700 800 900 0 20000 40000 60000 80000 100000 120000 140000 160000 ReadAverageLatency Read Operations Read Operations against Read Average Latency Hbase Cassandra Workload A HBase Cassandra Read operations Read Average latency Read Operations Read Average latency 25135 491.02 25033 685.64 50030 437.71 50142 562.19 75030 812.25 74938 583.13 100204 688.81 99511 547.56 149929 515.68 149704 568.72
  • 11. YCSB Workload A(Update Evaluation) 787.25 741.7 1318.7 1102.26 849.29 620.71 505.14 519.04 486.13 501.23 0 200 400 600 800 1000 1200 1400 0 20000 40000 60000 80000 100000 120000 140000 160000 UpdateAverageLatency Update Operations Update operations against Update Average Latency Hbase Cassandra Workload A HBase Cassandra Update operations Update Average latency Update Operations Update Average latency 24865 787.25 24967 620.71 49970 741.70 49858 505.14 74970 1318.70 75062 519.04 99796 1102.26 100489 486.13 150071 849.29 150296 501.23
  • 12. There’s a subsequent rise in the average latency for HBase between 60000 to 80000 as seen from the graph above. For Cassandra the average latency falls down between 40000 to 60000. YCSB Workload D (Read Evaluation) In the workload D, the graph depicts the average latency for HBase is increasing whereas for Cassandra it is decreasing. 332.95 312.76 341.2 335.17 442.95 605.02 548.78 516.91 476.41 503.77 0 100 200 300 400 500 600 700 0 50000 100000 150000 200000 250000 300000 ReadAverageLatency Read Operations Read Operation against Read Average Latency Hbase Cassandra Workload D HBase Cassandra Read operations Read Average latency Read Operations Read Average latency 47429 332.95 47517 605.02 95066 312.76 95039 548.78 142621 341.20 142689 516.91 189955 335.17 189883 476.41 284994 442.95 285269 503.77
  • 13. YSCB Workload D (Insert Evaluation) 1509.59 1302.26 1333.54 1192.89 1444.81 831.84 722.43 690.06 623.41 631.53 0 200 400 600 800 1000 1200 1400 1600 0 2000 4000 6000 8000 10000 12000 14000 16000 InsertAverageLatency Isert Operations Insert Operations against Insert Average Latency HBase Cassandra Workload D HBase Cassandra Insert operations Insert Average latency Insert Operations Insert Average latency 2571 1509.59 2483 831.84 4932 1302.26 4961 722.43 7379 1333.54 7311 690.06 10045 1192.89 10017 623.41 15006 1444.81 14731 631.53
  • 14. For the insert operations, the average latency for Hbase falls and then rises as depicted in the graph whereas for Cassandra it depletes down and maintains a approx. constant value . Records VS Throughput Workload A 1464.3 1617.39 895.23 1082.96 1429.33 1363.42 1735.36 1719.52 1841.37 1813.5 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 50000 100000 150000 200000 250000 300000 350000 Throughput Records Records against throughput Workload A Hbase Cassandra Workload A HBase Cassandra Records Throughput Records Throughput 50000 1464.30 50000 1363.42 100000 1617.39 100000 1735.36 150000 895.23 150000 1719.52 200000 1082.96 200000 1841.37 300000 1429.33 300000 1813.50
  • 15. As we can see from the graph as the records increase there is an effect on the throughput. Records VS Throughput Workload D Conclusion: Both the databases have their own capability and are used for storing and accessing data. Both of them have their own advantages and disadvantages being efficient in their own fields but from the research above it looks like Cassandra is much more efficient than 2074.51 2554.66 2324.68 2456.54 1952.34 1436.3 1665.72 1780.98 1961.93 1901.99 0 500 1000 1500 2000 2500 3000 0 50000 100000 150000 200000 250000 300000 350000 Throughput Records Records against throughput Workload D HBase Cassandra Workload D HBase Cassandra Records Throughput Records Throughput 50000 2074.51 50000 1436.30 100000 2554.66 100000 1665.72 150000 2324.68 150000 1780.98 200000 2456.54 200000 1961.93 300000 1952.34 300000 1901.99
  • 16. Hbase. Cassandra has been constant in any operations without much getting effected with the latency or no. of records. So with the research above we can say Cassandra is much more stable than Hbase. References : 1. Apache Cassandra. http://incubator.apache.org/cassandra/ 2. Google App Engine. http://appengine.google.com 3. SQL Data Services/Azure Services Platform. http://www.microsoft.com/azure/data.mspx. 4. Storage Performance Council. http://www.storageperformance.org/home. 5. Yahoo! Query Language. http://developer.yahoo.com/yql/. A. Arasu et al. Linear Road: a stream data management benchmark. In VLDB, 2004. 6. F. C. Botelho, D. Belazzougui, and M. Dietzfelbinger. Compress, hash and displace. In Proc. of the 17th European Symposium on Algorithms, 2009. 7. B. White et al. An integrated experimental environment for distributed systems and networks. In OSDI, 2002. 8. K. Yocum et al. Scalability and accuracy in a large-scale network emulator. In OSDI, 2002