Data Storage and Management project Report

1
Comparison of HBase and Cassandra, The Two
NoSQL Databases
Tushar Shailesh Dalvi
x18134301 / B
22/04/2019
Abstract
Today we are immersed with computerized information. However, we are poor
in overseeing and handling it. It is winding up progressively hard to store and break
down information proficiently and financially by means of customary database the
board instruments. Not just that, sort of information, showing up in the databases,
are likewise evolving. Researchers all over globe are bemusing with examination of
ultra-extensive databases. The Relational database like MSSQL have data ware-
housing ability to break databases in user point of view still it not reaches the
maximum performance resulting its take time to retrieve output and perform other
work, despite of that non-relational databases performing better than Relational.
There are number of Non-relational databases systems available in market like Big-
data,MongoDB,Hive. Thispaperaimsatevaluatingtheperformanceofrandom
reads and random writes information of HBase and Cassandra and compare the
results that we got through various ubuntu operation.
Keywords: BigData,HBase, Cassandra,YCSB,WorkLoad,Ubuntu,NoSQL,Hadoop,
RDBMS
1 Introduction
Inlastfewdecades,we’vewitnessedagoodrevolutionintechnologywithinthefront. The
traditional technique of managing structured knowledge includes a electronic information
service and schema to manage the storage and retrieval of the dataset. For managing
massive datasets in a very structured fashion, the first approaches are knowledge ware-
houses and data marts. an information warehouse could be a electronic information
service system used for storing, analyzing, and coverage functions. the info retail store
is that the layer accustomed access the data warehouse. the info keep within the ware-
house is sourced from the operational systems Bakshi (2012). throughout the identical
amount, we tend to are engulfed with the rise within the quantity of electronic and dig-
ital information. Half a decade a gone average size of company information attended
be within the vary of gigabytes (GBs). currently multi-terabyte (TB) or even Petabyte
(PBs) are common size for big company databases. To manage this amount of data few
yearbackcompaniesusingRelationalDatabaseManagementsystem,Butsincefewyear
back massive data isgenerating byinternet andsocial media which is impossible to han-
dle traditional database system, because of high latency in query processing, low rate of

2
transmitting thedata, lesshorizontal scalability. Toovercomethis new databasesystem,
introduce that is NoSQL. NoSQL stands for Not only SQL which based on Node to node
architecture. NoSQL dont require any kind of fixed schema. The main provocation for
this approach based on simpler database design, horizontal scaling to cluster machine,
andsuperior control overavailabilityNoSQLMcCreary& Kelly (2014). NoSQLdatabase
structurearetotallydifferentfromRelationaldatabases,whichmakesoperationsfaster.
There are total four types of NoSQL database:
Graphstores: whichusetostoredataingraphformatsuchassocialconnections
or network data example Neo4J.
Document Store: which stores the documents made up of tagged element for
example CouchDB.
Key Value Store: it has big hash tables which contain keys and values like
Amazon S3.
Column Based store: this type optimized for queries over large datasets, and
store columns of data together, instead of rows.
As of late NoSQL has update its databases with numerous extraordinary Features
and gave alternatives to the clients to picked the database according to their application
necessity. HBase could be a specific implementation of NoSQL within the Hadoop project.
Hbase could be a distributed column-oriented information designed on prime of HDFS
(Hadoop Distributed File System), to be mentioned in later section. HBase isn’t a on-
line database and doesn’t support SQL. However, it will host terribly giant, sparsely
inhabited tables on clusters made of trade goods hardware. the information is keep in
rows and column family cluster rows. The table column families should be such as a
part of the table schema definition. In HBase, tables are partitioned off horizontally
into regions, that are the units that get distributed over the HBase cluster. HBase
infrastructure design is additionally supported a distributed master-slave architecture.
The HBase master node orchestrates a cluster of 1 or a lot of slaves. HBase conjointly
depends on a assemblage service. HBase maintains all its information via Hadoop filing
system arthropod genus Bakshi (2012).
2 Key Characteristics
2.1 HBase
As a NoSQL DB, HBase offers plenty of good functionalities, Following are some of the
key Features of HBase.
HBase is specially used for large databases. It has ability to process peta bytes of
data in distributed environment.
HBaserunontheHDFShardwareEnvironment,andforHDFSneedlargenumber
of nodes (minimum 5).
HBase provide ability to access database quickly,if weneed random and real time
access to data HBase is nice selection.
•
•
•
•
•
•
•

3
HBase provides consistency feature for high-speed requirements for reads and writes
tasks.
To reduce I / O time and overhead, HBase offers automatic and manual division
of regions into smaller subregions once a threshold is reached.
• HBase provides data replication across clusters.
2.2 Cassendra
Cassandra is a very robust and complete NoSQL database that is being deployed by
some of the biggest companies like Facebook,google. Followingare somekeyfeatures of
Cassandra:
Cassandrahelpstodistributedataefficientlyovermultipledatacentersbyasimple
process of data replication.
• It writes high amount of data without affecting read efficiency.
• Cassandrasupportsallkindofstructure,unstructuredandsemistructureddatasets.
Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful column family data mode.
Cassandra provides a dynamo style replication model which provides no single
point failure and adds column family data mode.
Cassandraislinearlyscalable,whichmeansitincreasesthroughputasincreasedin
the number of nodes in the cluster. Therefore, it maintains a quick response time.
• Its an open source database like HBase, MongoDB
3 Database Architectures
3.1 HBase Architecture:
On the above of the HDFS HBase Provides low-latency random reads and writes. In
HBase tables become too large to handle because they are dynamically distributed. To
overcomethisHBase usehorizontal scalability asaRegion. A continuous, sortedgroups
of rows that arelined together isreferred to as a region. HBase architecture has a several
slaves i.e. region servers and single HBase master node (HMaster). Whenever a client
sendsawriterequest,HMasterreceivestherequestandforwardsittothecorresponding
region server. HBase can be run in a multiple master setup, but there is only single
master can stay active at a time Overview of HBase Architecture and its Components
(2016). Below are the main components of the Base architecture:
HMaster: HMaster manages and monitor the Hadoop cluster. HMaster is a light
weightprocessthatdecideswhichpartofdataisstoreinwhichregionforloadbalancing.
It also changes in any metadata by client request. DDL operations are handled by
HMaster. It administers in creating, updating and deleting tables.
Region Server: Region Server runs on HDFS Data node and handle Read, write
and delete operations. It runs on every node on Hadoop cluster. Region server have
following components:
•
•
•
•
•
•

4
Figure 1: HBase Architecture
1. Block Cache is the cache use for reading data. Reading data mostly stored in
Read cache and when the block cache is full, the recently used data is expelled.
2. MemStore is cache for writing and storing new data that is not yet written on
the disk. Every column family in a region there is a MemStore.
3. WAL (Write Ahead Log) is a file that stores new non-permanent storage
data.
• 4. HFile is an actual storage file that store a raw sorted key value on disk.
ZooKeeper: is the pattern of the service discovery that will keep the HMaster and
region serves together. The zookeeper services track all the region servers in HBase
Clusterthattrackinformationregardingtotalnumberofregionserversandwhichregion
servers having data node. It also responsible for establishing communication between
region servers and client. It also maintains the configuration information. It also provides
ephemeralnodesrepresentingvariousserversintheregionandhelpsfortrackingserver
failure and network partitions.
3.2 Cassandra Architecture:
Cassandra is a distributed peer-to-peer database running on a homogeneous node cluster.
It uses the Gossip Protocol in the background between the nodes to communicate with
each other and detect any faulty nodes in the cluster. It was architecture to handle large
volume of data while providing high availability.
ThemainkeyfeatureofCassandraarchitectureisthereisnosinglepointfailure. This
means that the cluster should continue to operate if there are 100 nodes in a cluster and
a node fails. Cassandra components are listed below Tutorialspoint.com (n.d.):
• 1. Node is a place where data is stored.
•
•
•

5
Figure 2: Cassandra Architecture
• 2. Data center is collection of related nodes.
• 3. Cluster is component that holds one or more data center.
4. Commit log is a crash recovery mechanism, every write operation written in
commit log.
5. Mem-Table is a data structure that reside in the memory. Data will be written
to the mem-table after commit log.
6. SS Table is a disk file to which the mem-table data is flushed when its content
reaches a threshold value.
7. Bloomfilterisnothingbutfast,nondeterministics,totestwhetheranelement
is a member of a set. Its a special cache type that are accessed after each query.
•
•
•
•

6
4 Comparison Between HBase and Cassandra on ba-
sis of Transaction management and Security:
As we all know its been ages SQL Rules the market and SQL main functionality is ACID
which stands for Atomic, Isolated, consistency and durable.
TransactionManagement: CassandradoesnotsupportthepropertiesofACID,
it gives you AID among them. That mean Cassandra are atomic, isolated, and
durable in nature. The C property of ACID is Consistency does not apply to Cas-
sandra,becausethereisnoreferentialintegrityconceptorforeignkeys. Cassandra
offers you to tune your Consistency Level as per your needs. You can either have
partial or full consistency. Cassandra use Light Weight Transactions; they are also
known as compare and set transactions. It can be used for both INSERT and
UPDATE statements, using the new IF clause. HBase provides strong consistency
for single-line updates or batch updates in a single region. However, it is highly
challenging and error-prone to implement exactly once semantics on HBase without
consistencyguarantees for data update across regions, table. Apache Tephrahelps
to provide ACID Transaction on HBase.
Security: Like all NoSQL databases, HBase and Cassandra have security issues,
the main one is that the performance of securing data leads to system heavy and
inflexible. But we can say that both systems have their own security features like
Cassandra have inter node and client to node encryption, and HBase provides secure
communicationwithothertechnologies. InCasandrawecandefinesuserrolesand
sets conditions for same roles so users can access limited data, in despite of that
HBase allocate data sets with a visibility label and then inform user what label
they can see.
5 Learning from literature survey
Asweknowtraditionaldatabasesystemwhichremainsmaindatastoragesystemforpast
30 to 40 year in many big IT industry. For longer periods of time, organizations need to
store more and more detailed information every year. Increased regulation is significantly
increasing storage volumes in areas such as health and finance, Because of the critical
natureoftheinformation,expensivesharedstoragesystemsoftenstorethisdata. Shared
storagearraysprovidefeatureslikestripingandmirroringwhichisusedforperformance
and availability are core factors for traditional databases. Managing the volume and
cost of this data growth within these traditional systems is usually a painful for IT
organization. On the above of that the limitations increasing in traditional databases
like poor Representation of real-world entities, compulsion of database normalization to
reduce the replication of data as well as poor performance for long duration transaction
was main reasons to increasing cost to keep database error less (n.d.).
But intodaysworldall weknowthatNoSQLisoutperformedontraditionalDatabase
systems,Asthedataincreasingrapidlybyinternetitsnotpossibletotraditionaldatabase
systemtohandlealldatatransactionsandresultleadstocompromisedinperformance,to
overcomethisproblem NoSQLinvented. Becauseof qualityofhandling Hugh databases
like bigdata we can handle all data easily with the help of various NoSQL database
systems. Also adopting cloud computing in the industry has became a trend. The rapid
•
•

7
development of internet-based service and application such as social media networks,
online shopping and web search engines generating multi tera byte data Wang et al.
(2014). Due to variety of databases available in market we need to concentrate on those
databaseswhosprovidealmostsamefeatures. AsperstudyperformedbyWaage&Wiese
(2014) Cassandra and HBase, which share a couple of common features. Both can be
considered as key value store and column family stores. By distributing and replicating
data they provide high availability as well as they implemented on Java Programming
language. To perform the benchmark testing on selected databases we need to select
proper benchmarking tool as we know number of benchmarking tools available in industry
such as SandStorm, YCSB, Altoros, etc. as per study conducted by Abubakar et al.
(2014)ITprofessional areworking hardtoensurethat thedatabases theychooseshould
be optimised for application success. Such selection can be made based on test with
benchmarks in the databases. Yahoo! Cloud Serving Benchmark (YCSB) framework
helps to do performance comparison of the new generation on NoSQL databases where
resources are limited. . Also, the reason to choose YCSB framework is that the YCSB
Client is Great Workload Generator, also we can install multiple systems on the same
hardware configuration, and run the same workloads each respective system Cooper et al.
(2010). TheYCSBClientisbasedonJavaprogramminglanguageforgeneratingthedata
to beloadedtothedatabaseandgeneratingtheoperationswhich makeuptheworkload.
The basic operation is that multiple client threads are driven by the workload executor.
Each thread runs a series of sequential operation by making calls to the interface layer
of the database for load and Run Phase. All threads also measure the latency and
performance of their operations and report these measurements in the module Abramova
et al. (2014).
The tools and parameter which used on above research papers motivated me to use
HBase and Cassandra as NoSQL databases to perform benchmarking using YCSB with
different workload type helps me to establish proper benchmarking report. To find the
proper and accurate benchmark I used different workload type. To select workload I
referred Abramova et al. (2014), as per study shown the different type of workload take
various amount of time from 10 mins to 10 hrs. In which Workload A update heavy
records in database the operation performed in this task is 50% read and 50% update,
same as A in B workload YCSB read Heavy data in which YCSB preformed 95% Read
and 5% update operation. In Workload C YCSB read data from database and perform
100% read operation in D workload ratio of Read is 95% and insert Ratio is 5% but task
performed as read latest inserted data, and workload E YCSB scan 95% data and insert
5%shortrangedataAbramovaetal.(2014). Basedonpreviousreport,Ichooseworkload
A, B to perform task like heavy write, Read and heavy read to check the durability of
HBase and Cassandra Databases.
6 Performance Test Plan
6.1 Physical Machine
• Processor: 2.00 GHz AMD Ryzen 5 2500U
• Number of Core: 4
• Memory: 20GB

8
• Operating System: Windows 10 Professional
• System Type: 64-bit Operating System, x64-based processor
• Virtualization tool: Open Stack Cloud Server
6.2 HBase Virtual Machine:
• Operating System : Ubuntu (64-bit)
• Memory: 4 GB
• Virtual Processor: 2
6.3 Cassandra Virtual Machine:
• Operating System : Ubuntu (64-bit)
• Memory: 4 GB
• Virtual Processor: 2
6.4 Benchmarking Tool:
• Yahoo! Cloud Serving benchmark 0.15.0 (YCSB-0.15.0)
6.5 Workload Parameter:
• WorkLoad A
• WorkLoad B
6.6 Virtualization tool:
• Tableau
7 Evaluations and Results
We have performed Workload A and B against HBase and Cassandra stores with the
Benchmarking tool (YCSB). Workload tests were performed twice and the average out-
puts wereassessed.
• WorkLoad A: Read = 50% and Update = 50%
• WorkLoad B: Read = 95% and Update = 5%

9
7.1 Workload A Result:
• Average Latency Vs. Throughput
Figure 3: Average Latancy Vs. Throughput
Above Graph Shows the comparison between Cassandra and HBase with respect to
the Load Throughput Vs. Load average Latency. As we can say that as Load Throughput
is going down consistently till count 3, so load throughput going down as load average is
increasing for HBase. Wherein initially as load average latency increase load throughput is
alsoincreasingtillcount3butaftercount3loadthroughputisdecreasingalongwiththat
avg. latency is increasing. From the graph we can say that number of transactions per
second decreasing time is also decreasing but as compare Cassandra HBase is performing
better.

10
• Read operation vs. update latency
Figure 4: Read operation vs. update latency
In the above fig, comparison between Average Latency and Read Operation. As we
can see count 1 average latency is 403.87 and it is decreasing slowly as read operation
increasing at count 3 is continually decreasing where avg. latency was 360 in Cassandra
database. But in HBase latency is increasing after count 2 till count 3. Which shows
that even number of read operations increasing Cassandra giving better performance than
HBase.
7.2 Workload B Result:
• Overall Insert latency Vs. Overall Throughput
Figure 5: Overall Insert latency Vs. Overall Throughput
In the above fig. we can see that for workload B for Cassandra and HBase. In Graph
even though overall throughput increasing per second average latency is decreasing for

11
Cassandra Database System. But for HBase initially average latency increased but after
count 3its decreasing for count 2 andits decreased below Casandra Database. So wecan
see that for Workload B initially HBase is performing poor as compare Cassandra but
after some count 3 HBase performance is increased.
• Read Operation Vs. Overall Read Throughput
Figure 6: Read Operation Vs. Overall Read Throughput
Above fig. illustrate that comparison between Read operation and Overall Read
Throughput, as the Graph describe for count 1 read throughput was more than 2600
and read operation was approximately 95000, as read operation count increase Read
throughputforHBasedatabasesystemisdecreasing, indespite ofthatasreadoperation
increases for Cassandra overall throughput is also increasing. Hence, we can say that
Hbase is performing better than Cassandra.

12
• Update Operation Vs. Update Average Latency
Figure 7: Update Operation Vs. Update Average Latency
In above fig we can clearly see that as the update operations increasing for count 2
update avg. latency is decreasing till count 2 but after that till count 3 its increasing
drastically, in despite of that as update operation increasing avg. latency is decreasing
slowly. Initially for count 1 avg. latency was 422.5 but for count 3 it was 397.1. So, we
can say that here in this test avg. latency is increasing for HBase but its decreasing for
Cassandra.

13
8 Conclusion and Discussion:
To conclude the above study, we can say that we conducted the experiment in which the
two very prominent Cassandra databases, HBase and Cassendra, were compared based
on parameters such as consistency, scalability, performance. In the above experiment,
the architecture of both databases was examined along with their key characteristics.
The experiment was conducted on the Yahoo! Cloud Serving Benchmark for Workload
A and Workload B. All workloads were tested twice, and the paper displayed average
output results. We can compare the above results and say that for Workload A with 50
percentReadand50percentonupdateoperationsbothdatabasesperformeddifferently.
for Workload A for 100000, 20000 and 300000 record count Hbase performed netter
than Cassendra but in Read operation vs. update latency cassandra providing better
performance.
For Workload B 50 percent Read and 50 percent write operation cassendra giving
better performace than HBase. and for Read operation HBAse is deforming better than
Cassendra. It is terribly tough to scale the performance of HBase with the increasing
employment. Wherein, prophetess is giving USA the expected results with significantly
lower latency and high outturn values. From the analysis we will say that the Cassendra
hasperformedfarbetterthanHBaseforallthegivenparametersagainsttheBenchmark.
Wherein, HBase has well-tried to high inconsistent throughout the experiment. Going
forward we will do the tests with a lot of high workloads to work out however these
2 NoSQL databases perform and that one is a lot of appropriate for the operations in
enterprises.
References
(n.d.).
Abramova, V., Bernardino, J. & Furtado, P. (2014), Evaluating cassandra scalability
withycsb,in‘InternationalConferenceonDatabaseandExpertSystemsApplications’,
Springer, pp.199–207.
Abubakar, Y., Adeyi, T. S.&Auta, I.G.(2014), ‘Performance evaluation of nosql systems
using ycsb in a resource austere environment’, Performance Evaluation 7(8), 23–27.
Bakshi,K.(2012),Considerationsforbigdata: Architectureandapproach,in‘2012IEEE
Aerospace Conference’, IEEE, pp. 1–7.
Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R. & Sears, R. (2010), Bench-
marking cloud serving systems with ycsb, in ‘Proceedings of the 1st ACM symposium
on Cloud computing’, ACM, pp. 143–154.
McCreary, D. & Kelly, A. (2014), ‘Making sense of nosql’, Shelter Island: Manning
pp. 19–20.
Overview of HBase Architecture and its Components (2016).
URL: https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-
components/295

14
Tutorialspoint.com (n.d.), ‘Cassandra architecture’.
URL: https://www.tutorialspoint.com/cassandra/cassandraarchitecture.htm
Waage,T.&Wiese,L.(2014),Benchmarkingencrypteddatastorageinhbaseandcassan-
dra with ycsb, in ‘International Symposium on Foundations and Practice of Security’,
Springer, pp.311–325.
Wang,H.,Li,J.,Zhang, H.&Zhou,Y.(2014),Benchmarkingreplicationandconsistency
strategiesincloud serving databases: Hbaseandcassandra, in ‘WorkshoponBig Data
Benchmarks,PerformanceOptimization,andEmergingHardware’,Springer,pp.71–
82.

Data Storage and Management project Report

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Storage and Management project Report

Similaire à Data Storage and Management project Report (20)

Dernier

Dernier (20)

Data Storage and Management project Report