SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
1
Comparison of HBase and Cassandra, The Two
NoSQL Databases
Tushar Shailesh Dalvi
x18134301 / B
22/04/2019
Abstract
Today we are immersed with computerized information. However, we are poor
in overseeing and handling it. It is winding up progressively hard to store and break
down information proficiently and financially by means of customary database the
board instruments. Not just that, sort of information, showing up in the databases,
are likewise evolving. Researchers all over globe are bemusing with examination of
ultra-extensive databases. The Relational database like MSSQL have data ware-
housing ability to break databases in user point of view still it not reaches the
maximum performance resulting its take time to retrieve output and perform other
work, despite of that non-relational databases performing better than Relational.
There are number of Non-relational databases systems available in market like Big-
data,MongoDB,Hive. Thispaperaimsatevaluatingtheperformanceofrandom
reads and random writes information of HBase and Cassandra and compare the
results that we got through various ubuntu operation.
Keywords: BigData,HBase, Cassandra,YCSB,WorkLoad,Ubuntu,NoSQL,Hadoop,
RDBMS
1 Introduction
Inlastfewdecades,we’vewitnessedagoodrevolutionintechnologywithinthefront. The
traditional technique of managing structured knowledge includes a electronic information
service and schema to manage the storage and retrieval of the dataset. For managing
massive datasets in a very structured fashion, the first approaches are knowledge ware-
houses and data marts. an information warehouse could be a electronic information
service system used for storing, analyzing, and coverage functions. the info retail store
is that the layer accustomed access the data warehouse. the info keep within the ware-
house is sourced from the operational systems Bakshi (2012). throughout the identical
amount, we tend to are engulfed with the rise within the quantity of electronic and dig-
ital information. Half a decade a gone average size of company information attended
be within the vary of gigabytes (GBs). currently multi-terabyte (TB) or even Petabyte
(PBs) are common size for big company databases. To manage this amount of data few
yearbackcompaniesusingRelationalDatabaseManagementsystem,Butsincefewyear
back massive data isgenerating byinternet andsocial media which is impossible to han-
dle traditional database system, because of high latency in query processing, low rate of
2
transmitting thedata, lesshorizontal scalability. Toovercomethis new databasesystem,
introduce that is NoSQL. NoSQL stands for Not only SQL which based on Node to node
architecture. NoSQL dont require any kind of fixed schema. The main provocation for
this approach based on simpler database design, horizontal scaling to cluster machine,
andsuperior control overavailabilityNoSQLMcCreary& Kelly (2014). NoSQLdatabase
structurearetotallydifferentfromRelationaldatabases,whichmakesoperationsfaster.
There are total four types of NoSQL database:
Graphstores: whichusetostoredataingraphformatsuchassocialconnections
or network data example Neo4J.
Document Store: which stores the documents made up of tagged element for
example CouchDB.
Key Value Store: it has big hash tables which contain keys and values like
Amazon S3.
Column Based store: this type optimized for queries over large datasets, and
store columns of data together, instead of rows.
As of late NoSQL has update its databases with numerous extraordinary Features
and gave alternatives to the clients to picked the database according to their application
necessity. HBase could be a specific implementation of NoSQL within the Hadoop project.
Hbase could be a distributed column-oriented information designed on prime of HDFS
(Hadoop Distributed File System), to be mentioned in later section. HBase isn’t a on-
line database and doesn’t support SQL. However, it will host terribly giant, sparsely
inhabited tables on clusters made of trade goods hardware. the information is keep in
rows and column family cluster rows. The table column families should be such as a
part of the table schema definition. In HBase, tables are partitioned off horizontally
into regions, that are the units that get distributed over the HBase cluster. HBase
infrastructure design is additionally supported a distributed master-slave architecture.
The HBase master node orchestrates a cluster of 1 or a lot of slaves. HBase conjointly
depends on a assemblage service. HBase maintains all its information via Hadoop filing
system arthropod genus Bakshi (2012).
2 Key Characteristics
2.1 HBase
As a NoSQL DB, HBase offers plenty of good functionalities, Following are some of the
key Features of HBase.
HBase is specially used for large databases. It has ability to process peta bytes of
data in distributed environment.
HBaserunontheHDFShardwareEnvironment,andforHDFSneedlargenumber
of nodes (minimum 5).
HBase provide ability to access database quickly,if weneed random and real time
access to data HBase is nice selection.
•
•
•
•
•
•
•
3
HBase provides consistency feature for high-speed requirements for reads and writes
tasks.
To reduce I / O time and overhead, HBase offers automatic and manual division
of regions into smaller subregions once a threshold is reached.
• HBase provides data replication across clusters.
2.2 Cassendra
Cassandra is a very robust and complete NoSQL database that is being deployed by
some of the biggest companies like Facebook,google. Followingare somekeyfeatures of
Cassandra:
Cassandrahelpstodistributedataefficientlyovermultipledatacentersbyasimple
process of data replication.
• It writes high amount of data without affecting read efficiency.
• Cassandrasupportsallkindofstructure,unstructuredandsemistructureddatasets.
Cassandra implements a Dynamo-style replication model with no single point of
failure, but adds a more powerful column family data mode.
Cassandra provides a dynamo style replication model which provides no single
point failure and adds column family data mode.
Cassandraislinearlyscalable,whichmeansitincreasesthroughputasincreasedin
the number of nodes in the cluster. Therefore, it maintains a quick response time.
• Its an open source database like HBase, MongoDB
3 Database Architectures
3.1 HBase Architecture:
On the above of the HDFS HBase Provides low-latency random reads and writes. In
HBase tables become too large to handle because they are dynamically distributed. To
overcomethisHBase usehorizontal scalability asaRegion. A continuous, sortedgroups
of rows that arelined together isreferred to as a region. HBase architecture has a several
slaves i.e. region servers and single HBase master node (HMaster). Whenever a client
sendsawriterequest,HMasterreceivestherequestandforwardsittothecorresponding
region server. HBase can be run in a multiple master setup, but there is only single
master can stay active at a time Overview of HBase Architecture and its Components
(2016). Below are the main components of the Base architecture:
HMaster: HMaster manages and monitor the Hadoop cluster. HMaster is a light
weightprocessthatdecideswhichpartofdataisstoreinwhichregionforloadbalancing.
It also changes in any metadata by client request. DDL operations are handled by
HMaster. It administers in creating, updating and deleting tables.
Region Server: Region Server runs on HDFS Data node and handle Read, write
and delete operations. It runs on every node on Hadoop cluster. Region server have
following components:
•
•
•
•
•
•
4
Figure 1: HBase Architecture
1. Block Cache is the cache use for reading data. Reading data mostly stored in
Read cache and when the block cache is full, the recently used data is expelled.
2. MemStore is cache for writing and storing new data that is not yet written on
the disk. Every column family in a region there is a MemStore.
3. WAL (Write Ahead Log) is a file that stores new non-permanent storage
data.
• 4. HFile is an actual storage file that store a raw sorted key value on disk.
ZooKeeper: is the pattern of the service discovery that will keep the HMaster and
region serves together. The zookeeper services track all the region servers in HBase
Clusterthattrackinformationregardingtotalnumberofregionserversandwhichregion
servers having data node. It also responsible for establishing communication between
region servers and client. It also maintains the configuration information. It also provides
ephemeralnodesrepresentingvariousserversintheregionandhelpsfortrackingserver
failure and network partitions.
3.2 Cassandra Architecture:
Cassandra is a distributed peer-to-peer database running on a homogeneous node cluster.
It uses the Gossip Protocol in the background between the nodes to communicate with
each other and detect any faulty nodes in the cluster. It was architecture to handle large
volume of data while providing high availability.
ThemainkeyfeatureofCassandraarchitectureisthereisnosinglepointfailure. This
means that the cluster should continue to operate if there are 100 nodes in a cluster and
a node fails. Cassandra components are listed below Tutorialspoint.com (n.d.):
• 1. Node is a place where data is stored.
•
•
•
5
Figure 2: Cassandra Architecture
• 2. Data center is collection of related nodes.
• 3. Cluster is component that holds one or more data center.
4. Commit log is a crash recovery mechanism, every write operation written in
commit log.
5. Mem-Table is a data structure that reside in the memory. Data will be written
to the mem-table after commit log.
6. SS Table is a disk file to which the mem-table data is flushed when its content
reaches a threshold value.
7. Bloomfilterisnothingbutfast,nondeterministics,totestwhetheranelement
is a member of a set. Its a special cache type that are accessed after each query.
•
•
•
•
6
4 Comparison Between HBase and Cassandra on ba-
sis of Transaction management and Security:
As we all know its been ages SQL Rules the market and SQL main functionality is ACID
which stands for Atomic, Isolated, consistency and durable.
TransactionManagement: CassandradoesnotsupportthepropertiesofACID,
it gives you AID among them. That mean Cassandra are atomic, isolated, and
durable in nature. The C property of ACID is Consistency does not apply to Cas-
sandra,becausethereisnoreferentialintegrityconceptorforeignkeys. Cassandra
offers you to tune your Consistency Level as per your needs. You can either have
partial or full consistency. Cassandra use Light Weight Transactions; they are also
known as compare and set transactions. It can be used for both INSERT and
UPDATE statements, using the new IF clause. HBase provides strong consistency
for single-line updates or batch updates in a single region. However, it is highly
challenging and error-prone to implement exactly once semantics on HBase without
consistencyguarantees for data update across regions, table. Apache Tephrahelps
to provide ACID Transaction on HBase.
Security: Like all NoSQL databases, HBase and Cassandra have security issues,
the main one is that the performance of securing data leads to system heavy and
inflexible. But we can say that both systems have their own security features like
Cassandra have inter node and client to node encryption, and HBase provides secure
communicationwithothertechnologies. InCasandrawecandefinesuserrolesand
sets conditions for same roles so users can access limited data, in despite of that
HBase allocate data sets with a visibility label and then inform user what label
they can see.
5 Learning from literature survey
Asweknowtraditionaldatabasesystemwhichremainsmaindatastoragesystemforpast
30 to 40 year in many big IT industry. For longer periods of time, organizations need to
store more and more detailed information every year. Increased regulation is significantly
increasing storage volumes in areas such as health and finance, Because of the critical
natureoftheinformation,expensivesharedstoragesystemsoftenstorethisdata. Shared
storagearraysprovidefeatureslikestripingandmirroringwhichisusedforperformance
and availability are core factors for traditional databases. Managing the volume and
cost of this data growth within these traditional systems is usually a painful for IT
organization. On the above of that the limitations increasing in traditional databases
like poor Representation of real-world entities, compulsion of database normalization to
reduce the replication of data as well as poor performance for long duration transaction
was main reasons to increasing cost to keep database error less (n.d.).
But intodaysworldall weknowthatNoSQLisoutperformedontraditionalDatabase
systems,Asthedataincreasingrapidlybyinternetitsnotpossibletotraditionaldatabase
systemtohandlealldatatransactionsandresultleadstocompromisedinperformance,to
overcomethisproblem NoSQLinvented. Becauseof qualityofhandling Hugh databases
like bigdata we can handle all data easily with the help of various NoSQL database
systems. Also adopting cloud computing in the industry has became a trend. The rapid
•
•
7
development of internet-based service and application such as social media networks,
online shopping and web search engines generating multi tera byte data Wang et al.
(2014). Due to variety of databases available in market we need to concentrate on those
databaseswhosprovidealmostsamefeatures. AsperstudyperformedbyWaage&Wiese
(2014) Cassandra and HBase, which share a couple of common features. Both can be
considered as key value store and column family stores. By distributing and replicating
data they provide high availability as well as they implemented on Java Programming
language. To perform the benchmark testing on selected databases we need to select
proper benchmarking tool as we know number of benchmarking tools available in industry
such as SandStorm, YCSB, Altoros, etc. as per study conducted by Abubakar et al.
(2014)ITprofessional areworking hardtoensurethat thedatabases theychooseshould
be optimised for application success. Such selection can be made based on test with
benchmarks in the databases. Yahoo! Cloud Serving Benchmark (YCSB) framework
helps to do performance comparison of the new generation on NoSQL databases where
resources are limited. . Also, the reason to choose YCSB framework is that the YCSB
Client is Great Workload Generator, also we can install multiple systems on the same
hardware configuration, and run the same workloads each respective system Cooper et al.
(2010). TheYCSBClientisbasedonJavaprogramminglanguageforgeneratingthedata
to beloadedtothedatabaseandgeneratingtheoperationswhich makeuptheworkload.
The basic operation is that multiple client threads are driven by the workload executor.
Each thread runs a series of sequential operation by making calls to the interface layer
of the database for load and Run Phase. All threads also measure the latency and
performance of their operations and report these measurements in the module Abramova
et al. (2014).
The tools and parameter which used on above research papers motivated me to use
HBase and Cassandra as NoSQL databases to perform benchmarking using YCSB with
different workload type helps me to establish proper benchmarking report. To find the
proper and accurate benchmark I used different workload type. To select workload I
referred Abramova et al. (2014), as per study shown the different type of workload take
various amount of time from 10 mins to 10 hrs. In which Workload A update heavy
records in database the operation performed in this task is 50% read and 50% update,
same as A in B workload YCSB read Heavy data in which YCSB preformed 95% Read
and 5% update operation. In Workload C YCSB read data from database and perform
100% read operation in D workload ratio of Read is 95% and insert Ratio is 5% but task
performed as read latest inserted data, and workload E YCSB scan 95% data and insert
5%shortrangedataAbramovaetal.(2014). Basedonpreviousreport,Ichooseworkload
A, B to perform task like heavy write, Read and heavy read to check the durability of
HBase and Cassandra Databases.
6 Performance Test Plan
6.1 Physical Machine
• Processor: 2.00 GHz AMD Ryzen 5 2500U
• Number of Core: 4
• Memory: 20GB
8
• Operating System: Windows 10 Professional
• System Type: 64-bit Operating System, x64-based processor
• Virtualization tool: Open Stack Cloud Server
6.2 HBase Virtual Machine:
• Operating System : Ubuntu (64-bit)
• Memory: 4 GB
• Virtual Processor: 2
6.3 Cassandra Virtual Machine:
• Operating System : Ubuntu (64-bit)
• Memory: 4 GB
• Virtual Processor: 2
6.4 Benchmarking Tool:
• Yahoo! Cloud Serving benchmark 0.15.0 (YCSB-0.15.0)
6.5 Workload Parameter:
• WorkLoad A
• WorkLoad B
6.6 Virtualization tool:
• Tableau
7 Evaluations and Results
We have performed Workload A and B against HBase and Cassandra stores with the
Benchmarking tool (YCSB). Workload tests were performed twice and the average out-
puts wereassessed.
• WorkLoad A: Read = 50% and Update = 50%
• WorkLoad B: Read = 95% and Update = 5%
9
7.1 Workload A Result:
• Average Latency Vs. Throughput
Figure 3: Average Latancy Vs. Throughput
Above Graph Shows the comparison between Cassandra and HBase with respect to
the Load Throughput Vs. Load average Latency. As we can say that as Load Throughput
is going down consistently till count 3, so load throughput going down as load average is
increasing for HBase. Wherein initially as load average latency increase load throughput is
alsoincreasingtillcount3butaftercount3loadthroughputisdecreasingalongwiththat
avg. latency is increasing. From the graph we can say that number of transactions per
second decreasing time is also decreasing but as compare Cassandra HBase is performing
better.
10
• Read operation vs. update latency
Figure 4: Read operation vs. update latency
In the above fig, comparison between Average Latency and Read Operation. As we
can see count 1 average latency is 403.87 and it is decreasing slowly as read operation
increasing at count 3 is continually decreasing where avg. latency was 360 in Cassandra
database. But in HBase latency is increasing after count 2 till count 3. Which shows
that even number of read operations increasing Cassandra giving better performance than
HBase.
7.2 Workload B Result:
• Overall Insert latency Vs. Overall Throughput
Figure 5: Overall Insert latency Vs. Overall Throughput
In the above fig. we can see that for workload B for Cassandra and HBase. In Graph
even though overall throughput increasing per second average latency is decreasing for
11
Cassandra Database System. But for HBase initially average latency increased but after
count 3its decreasing for count 2 andits decreased below Casandra Database. So wecan
see that for Workload B initially HBase is performing poor as compare Cassandra but
after some count 3 HBase performance is increased.
• Read Operation Vs. Overall Read Throughput
Figure 6: Read Operation Vs. Overall Read Throughput
Above fig. illustrate that comparison between Read operation and Overall Read
Throughput, as the Graph describe for count 1 read throughput was more than 2600
and read operation was approximately 95000, as read operation count increase Read
throughputforHBasedatabasesystemisdecreasing, indespite ofthatasreadoperation
increases for Cassandra overall throughput is also increasing. Hence, we can say that
Hbase is performing better than Cassandra.
12
• Update Operation Vs. Update Average Latency
Figure 7: Update Operation Vs. Update Average Latency
In above fig we can clearly see that as the update operations increasing for count 2
update avg. latency is decreasing till count 2 but after that till count 3 its increasing
drastically, in despite of that as update operation increasing avg. latency is decreasing
slowly. Initially for count 1 avg. latency was 422.5 but for count 3 it was 397.1. So, we
can say that here in this test avg. latency is increasing for HBase but its decreasing for
Cassandra.
13
8 Conclusion and Discussion:
To conclude the above study, we can say that we conducted the experiment in which the
two very prominent Cassandra databases, HBase and Cassendra, were compared based
on parameters such as consistency, scalability, performance. In the above experiment,
the architecture of both databases was examined along with their key characteristics.
The experiment was conducted on the Yahoo! Cloud Serving Benchmark for Workload
A and Workload B. All workloads were tested twice, and the paper displayed average
output results. We can compare the above results and say that for Workload A with 50
percentReadand50percentonupdateoperationsbothdatabasesperformeddifferently.
for Workload A for 100000, 20000 and 300000 record count Hbase performed netter
than Cassendra but in Read operation vs. update latency cassandra providing better
performance.
For Workload B 50 percent Read and 50 percent write operation cassendra giving
better performace than HBase. and for Read operation HBAse is deforming better than
Cassendra. It is terribly tough to scale the performance of HBase with the increasing
employment. Wherein, prophetess is giving USA the expected results with significantly
lower latency and high outturn values. From the analysis we will say that the Cassendra
hasperformedfarbetterthanHBaseforallthegivenparametersagainsttheBenchmark.
Wherein, HBase has well-tried to high inconsistent throughout the experiment. Going
forward we will do the tests with a lot of high workloads to work out however these
2 NoSQL databases perform and that one is a lot of appropriate for the operations in
enterprises.
References
(n.d.).
Abramova, V., Bernardino, J. & Furtado, P. (2014), Evaluating cassandra scalability
withycsb,in‘InternationalConferenceonDatabaseandExpertSystemsApplications’,
Springer, pp.199–207.
Abubakar, Y., Adeyi, T. S.&Auta, I.G.(2014), ‘Performance evaluation of nosql systems
using ycsb in a resource austere environment’, Performance Evaluation 7(8), 23–27.
Bakshi,K.(2012),Considerationsforbigdata: Architectureandapproach,in‘2012IEEE
Aerospace Conference’, IEEE, pp. 1–7.
Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R. & Sears, R. (2010), Bench-
marking cloud serving systems with ycsb, in ‘Proceedings of the 1st ACM symposium
on Cloud computing’, ACM, pp. 143–154.
McCreary, D. & Kelly, A. (2014), ‘Making sense of nosql’, Shelter Island: Manning
pp. 19–20.
Overview of HBase Architecture and its Components (2016).
URL: https://www.dezyre.com/article/overview-of-hbase-architecture-and-its-
components/295
14
Tutorialspoint.com (n.d.), ‘Cassandra architecture’.
URL: https://www.tutorialspoint.com/cassandra/cassandraarchitecture.htm
Waage,T.&Wiese,L.(2014),Benchmarkingencrypteddatastorageinhbaseandcassan-
dra with ycsb, in ‘International Symposium on Foundations and Practice of Security’,
Springer, pp.311–325.
Wang,H.,Li,J.,Zhang, H.&Zhou,Y.(2014),Benchmarkingreplicationandconsistency
strategiesincloud serving databases: Hbaseandcassandra, in ‘WorkshoponBig Data
Benchmarks,PerformanceOptimization,andEmergingHardware’,Springer,pp.71–
82.

Contenu connexe

Tendances

A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributed
João Gabriel Lima
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
thkoch
 

Tendances (20)

Nosql
NosqlNosql
Nosql
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
No sql database
No sql databaseNo sql database
No sql database
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
A request skew aware heterogeneous distributed
A request skew aware heterogeneous distributedA request skew aware heterogeneous distributed
A request skew aware heterogeneous distributed
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understanding
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 

Similaire à Data Storage and Management project Report

Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applications
Jeff Harris
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 

Similaire à Data Storage and Management project Report (20)

Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
Hbase
HbaseHbase
Hbase
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
No sql
No sqlNo sql
No sql
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
HBase Mongo_DB Project
HBase Mongo_DB ProjectHBase Mongo_DB Project
HBase Mongo_DB Project
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applications
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Nosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understandingNosql Presentation.pdf for DBMS understanding
Nosql Presentation.pdf for DBMS understanding
 
Cassandra tutorial
Cassandra tutorialCassandra tutorial
Cassandra tutorial
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Vskills Apache Cassandra sample material
Vskills Apache Cassandra sample materialVskills Apache Cassandra sample material
Vskills Apache Cassandra sample material
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 
Uint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdfUint-5 Big data Frameworks.pdf
Uint-5 Big data Frameworks.pdf
 

Dernier

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Data Storage and Management project Report

  • 1. 1 Comparison of HBase and Cassandra, The Two NoSQL Databases Tushar Shailesh Dalvi x18134301 / B 22/04/2019 Abstract Today we are immersed with computerized information. However, we are poor in overseeing and handling it. It is winding up progressively hard to store and break down information proficiently and financially by means of customary database the board instruments. Not just that, sort of information, showing up in the databases, are likewise evolving. Researchers all over globe are bemusing with examination of ultra-extensive databases. The Relational database like MSSQL have data ware- housing ability to break databases in user point of view still it not reaches the maximum performance resulting its take time to retrieve output and perform other work, despite of that non-relational databases performing better than Relational. There are number of Non-relational databases systems available in market like Big- data,MongoDB,Hive. Thispaperaimsatevaluatingtheperformanceofrandom reads and random writes information of HBase and Cassandra and compare the results that we got through various ubuntu operation. Keywords: BigData,HBase, Cassandra,YCSB,WorkLoad,Ubuntu,NoSQL,Hadoop, RDBMS 1 Introduction Inlastfewdecades,we’vewitnessedagoodrevolutionintechnologywithinthefront. The traditional technique of managing structured knowledge includes a electronic information service and schema to manage the storage and retrieval of the dataset. For managing massive datasets in a very structured fashion, the first approaches are knowledge ware- houses and data marts. an information warehouse could be a electronic information service system used for storing, analyzing, and coverage functions. the info retail store is that the layer accustomed access the data warehouse. the info keep within the ware- house is sourced from the operational systems Bakshi (2012). throughout the identical amount, we tend to are engulfed with the rise within the quantity of electronic and dig- ital information. Half a decade a gone average size of company information attended be within the vary of gigabytes (GBs). currently multi-terabyte (TB) or even Petabyte (PBs) are common size for big company databases. To manage this amount of data few yearbackcompaniesusingRelationalDatabaseManagementsystem,Butsincefewyear back massive data isgenerating byinternet andsocial media which is impossible to han- dle traditional database system, because of high latency in query processing, low rate of
  • 2. 2 transmitting thedata, lesshorizontal scalability. Toovercomethis new databasesystem, introduce that is NoSQL. NoSQL stands for Not only SQL which based on Node to node architecture. NoSQL dont require any kind of fixed schema. The main provocation for this approach based on simpler database design, horizontal scaling to cluster machine, andsuperior control overavailabilityNoSQLMcCreary& Kelly (2014). NoSQLdatabase structurearetotallydifferentfromRelationaldatabases,whichmakesoperationsfaster. There are total four types of NoSQL database: Graphstores: whichusetostoredataingraphformatsuchassocialconnections or network data example Neo4J. Document Store: which stores the documents made up of tagged element for example CouchDB. Key Value Store: it has big hash tables which contain keys and values like Amazon S3. Column Based store: this type optimized for queries over large datasets, and store columns of data together, instead of rows. As of late NoSQL has update its databases with numerous extraordinary Features and gave alternatives to the clients to picked the database according to their application necessity. HBase could be a specific implementation of NoSQL within the Hadoop project. Hbase could be a distributed column-oriented information designed on prime of HDFS (Hadoop Distributed File System), to be mentioned in later section. HBase isn’t a on- line database and doesn’t support SQL. However, it will host terribly giant, sparsely inhabited tables on clusters made of trade goods hardware. the information is keep in rows and column family cluster rows. The table column families should be such as a part of the table schema definition. In HBase, tables are partitioned off horizontally into regions, that are the units that get distributed over the HBase cluster. HBase infrastructure design is additionally supported a distributed master-slave architecture. The HBase master node orchestrates a cluster of 1 or a lot of slaves. HBase conjointly depends on a assemblage service. HBase maintains all its information via Hadoop filing system arthropod genus Bakshi (2012). 2 Key Characteristics 2.1 HBase As a NoSQL DB, HBase offers plenty of good functionalities, Following are some of the key Features of HBase. HBase is specially used for large databases. It has ability to process peta bytes of data in distributed environment. HBaserunontheHDFShardwareEnvironment,andforHDFSneedlargenumber of nodes (minimum 5). HBase provide ability to access database quickly,if weneed random and real time access to data HBase is nice selection. • • • • • • •
  • 3. 3 HBase provides consistency feature for high-speed requirements for reads and writes tasks. To reduce I / O time and overhead, HBase offers automatic and manual division of regions into smaller subregions once a threshold is reached. • HBase provides data replication across clusters. 2.2 Cassendra Cassandra is a very robust and complete NoSQL database that is being deployed by some of the biggest companies like Facebook,google. Followingare somekeyfeatures of Cassandra: Cassandrahelpstodistributedataefficientlyovermultipledatacentersbyasimple process of data replication. • It writes high amount of data without affecting read efficiency. • Cassandrasupportsallkindofstructure,unstructuredandsemistructureddatasets. Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful column family data mode. Cassandra provides a dynamo style replication model which provides no single point failure and adds column family data mode. Cassandraislinearlyscalable,whichmeansitincreasesthroughputasincreasedin the number of nodes in the cluster. Therefore, it maintains a quick response time. • Its an open source database like HBase, MongoDB 3 Database Architectures 3.1 HBase Architecture: On the above of the HDFS HBase Provides low-latency random reads and writes. In HBase tables become too large to handle because they are dynamically distributed. To overcomethisHBase usehorizontal scalability asaRegion. A continuous, sortedgroups of rows that arelined together isreferred to as a region. HBase architecture has a several slaves i.e. region servers and single HBase master node (HMaster). Whenever a client sendsawriterequest,HMasterreceivestherequestandforwardsittothecorresponding region server. HBase can be run in a multiple master setup, but there is only single master can stay active at a time Overview of HBase Architecture and its Components (2016). Below are the main components of the Base architecture: HMaster: HMaster manages and monitor the Hadoop cluster. HMaster is a light weightprocessthatdecideswhichpartofdataisstoreinwhichregionforloadbalancing. It also changes in any metadata by client request. DDL operations are handled by HMaster. It administers in creating, updating and deleting tables. Region Server: Region Server runs on HDFS Data node and handle Read, write and delete operations. It runs on every node on Hadoop cluster. Region server have following components: • • • • • •
  • 4. 4 Figure 1: HBase Architecture 1. Block Cache is the cache use for reading data. Reading data mostly stored in Read cache and when the block cache is full, the recently used data is expelled. 2. MemStore is cache for writing and storing new data that is not yet written on the disk. Every column family in a region there is a MemStore. 3. WAL (Write Ahead Log) is a file that stores new non-permanent storage data. • 4. HFile is an actual storage file that store a raw sorted key value on disk. ZooKeeper: is the pattern of the service discovery that will keep the HMaster and region serves together. The zookeeper services track all the region servers in HBase Clusterthattrackinformationregardingtotalnumberofregionserversandwhichregion servers having data node. It also responsible for establishing communication between region servers and client. It also maintains the configuration information. It also provides ephemeralnodesrepresentingvariousserversintheregionandhelpsfortrackingserver failure and network partitions. 3.2 Cassandra Architecture: Cassandra is a distributed peer-to-peer database running on a homogeneous node cluster. It uses the Gossip Protocol in the background between the nodes to communicate with each other and detect any faulty nodes in the cluster. It was architecture to handle large volume of data while providing high availability. ThemainkeyfeatureofCassandraarchitectureisthereisnosinglepointfailure. This means that the cluster should continue to operate if there are 100 nodes in a cluster and a node fails. Cassandra components are listed below Tutorialspoint.com (n.d.): • 1. Node is a place where data is stored. • • •
  • 5. 5 Figure 2: Cassandra Architecture • 2. Data center is collection of related nodes. • 3. Cluster is component that holds one or more data center. 4. Commit log is a crash recovery mechanism, every write operation written in commit log. 5. Mem-Table is a data structure that reside in the memory. Data will be written to the mem-table after commit log. 6. SS Table is a disk file to which the mem-table data is flushed when its content reaches a threshold value. 7. Bloomfilterisnothingbutfast,nondeterministics,totestwhetheranelement is a member of a set. Its a special cache type that are accessed after each query. • • • •
  • 6. 6 4 Comparison Between HBase and Cassandra on ba- sis of Transaction management and Security: As we all know its been ages SQL Rules the market and SQL main functionality is ACID which stands for Atomic, Isolated, consistency and durable. TransactionManagement: CassandradoesnotsupportthepropertiesofACID, it gives you AID among them. That mean Cassandra are atomic, isolated, and durable in nature. The C property of ACID is Consistency does not apply to Cas- sandra,becausethereisnoreferentialintegrityconceptorforeignkeys. Cassandra offers you to tune your Consistency Level as per your needs. You can either have partial or full consistency. Cassandra use Light Weight Transactions; they are also known as compare and set transactions. It can be used for both INSERT and UPDATE statements, using the new IF clause. HBase provides strong consistency for single-line updates or batch updates in a single region. However, it is highly challenging and error-prone to implement exactly once semantics on HBase without consistencyguarantees for data update across regions, table. Apache Tephrahelps to provide ACID Transaction on HBase. Security: Like all NoSQL databases, HBase and Cassandra have security issues, the main one is that the performance of securing data leads to system heavy and inflexible. But we can say that both systems have their own security features like Cassandra have inter node and client to node encryption, and HBase provides secure communicationwithothertechnologies. InCasandrawecandefinesuserrolesand sets conditions for same roles so users can access limited data, in despite of that HBase allocate data sets with a visibility label and then inform user what label they can see. 5 Learning from literature survey Asweknowtraditionaldatabasesystemwhichremainsmaindatastoragesystemforpast 30 to 40 year in many big IT industry. For longer periods of time, organizations need to store more and more detailed information every year. Increased regulation is significantly increasing storage volumes in areas such as health and finance, Because of the critical natureoftheinformation,expensivesharedstoragesystemsoftenstorethisdata. Shared storagearraysprovidefeatureslikestripingandmirroringwhichisusedforperformance and availability are core factors for traditional databases. Managing the volume and cost of this data growth within these traditional systems is usually a painful for IT organization. On the above of that the limitations increasing in traditional databases like poor Representation of real-world entities, compulsion of database normalization to reduce the replication of data as well as poor performance for long duration transaction was main reasons to increasing cost to keep database error less (n.d.). But intodaysworldall weknowthatNoSQLisoutperformedontraditionalDatabase systems,Asthedataincreasingrapidlybyinternetitsnotpossibletotraditionaldatabase systemtohandlealldatatransactionsandresultleadstocompromisedinperformance,to overcomethisproblem NoSQLinvented. Becauseof qualityofhandling Hugh databases like bigdata we can handle all data easily with the help of various NoSQL database systems. Also adopting cloud computing in the industry has became a trend. The rapid • •
  • 7. 7 development of internet-based service and application such as social media networks, online shopping and web search engines generating multi tera byte data Wang et al. (2014). Due to variety of databases available in market we need to concentrate on those databaseswhosprovidealmostsamefeatures. AsperstudyperformedbyWaage&Wiese (2014) Cassandra and HBase, which share a couple of common features. Both can be considered as key value store and column family stores. By distributing and replicating data they provide high availability as well as they implemented on Java Programming language. To perform the benchmark testing on selected databases we need to select proper benchmarking tool as we know number of benchmarking tools available in industry such as SandStorm, YCSB, Altoros, etc. as per study conducted by Abubakar et al. (2014)ITprofessional areworking hardtoensurethat thedatabases theychooseshould be optimised for application success. Such selection can be made based on test with benchmarks in the databases. Yahoo! Cloud Serving Benchmark (YCSB) framework helps to do performance comparison of the new generation on NoSQL databases where resources are limited. . Also, the reason to choose YCSB framework is that the YCSB Client is Great Workload Generator, also we can install multiple systems on the same hardware configuration, and run the same workloads each respective system Cooper et al. (2010). TheYCSBClientisbasedonJavaprogramminglanguageforgeneratingthedata to beloadedtothedatabaseandgeneratingtheoperationswhich makeuptheworkload. The basic operation is that multiple client threads are driven by the workload executor. Each thread runs a series of sequential operation by making calls to the interface layer of the database for load and Run Phase. All threads also measure the latency and performance of their operations and report these measurements in the module Abramova et al. (2014). The tools and parameter which used on above research papers motivated me to use HBase and Cassandra as NoSQL databases to perform benchmarking using YCSB with different workload type helps me to establish proper benchmarking report. To find the proper and accurate benchmark I used different workload type. To select workload I referred Abramova et al. (2014), as per study shown the different type of workload take various amount of time from 10 mins to 10 hrs. In which Workload A update heavy records in database the operation performed in this task is 50% read and 50% update, same as A in B workload YCSB read Heavy data in which YCSB preformed 95% Read and 5% update operation. In Workload C YCSB read data from database and perform 100% read operation in D workload ratio of Read is 95% and insert Ratio is 5% but task performed as read latest inserted data, and workload E YCSB scan 95% data and insert 5%shortrangedataAbramovaetal.(2014). Basedonpreviousreport,Ichooseworkload A, B to perform task like heavy write, Read and heavy read to check the durability of HBase and Cassandra Databases. 6 Performance Test Plan 6.1 Physical Machine • Processor: 2.00 GHz AMD Ryzen 5 2500U • Number of Core: 4 • Memory: 20GB
  • 8. 8 • Operating System: Windows 10 Professional • System Type: 64-bit Operating System, x64-based processor • Virtualization tool: Open Stack Cloud Server 6.2 HBase Virtual Machine: • Operating System : Ubuntu (64-bit) • Memory: 4 GB • Virtual Processor: 2 6.3 Cassandra Virtual Machine: • Operating System : Ubuntu (64-bit) • Memory: 4 GB • Virtual Processor: 2 6.4 Benchmarking Tool: • Yahoo! Cloud Serving benchmark 0.15.0 (YCSB-0.15.0) 6.5 Workload Parameter: • WorkLoad A • WorkLoad B 6.6 Virtualization tool: • Tableau 7 Evaluations and Results We have performed Workload A and B against HBase and Cassandra stores with the Benchmarking tool (YCSB). Workload tests were performed twice and the average out- puts wereassessed. • WorkLoad A: Read = 50% and Update = 50% • WorkLoad B: Read = 95% and Update = 5%
  • 9. 9 7.1 Workload A Result: • Average Latency Vs. Throughput Figure 3: Average Latancy Vs. Throughput Above Graph Shows the comparison between Cassandra and HBase with respect to the Load Throughput Vs. Load average Latency. As we can say that as Load Throughput is going down consistently till count 3, so load throughput going down as load average is increasing for HBase. Wherein initially as load average latency increase load throughput is alsoincreasingtillcount3butaftercount3loadthroughputisdecreasingalongwiththat avg. latency is increasing. From the graph we can say that number of transactions per second decreasing time is also decreasing but as compare Cassandra HBase is performing better.
  • 10. 10 • Read operation vs. update latency Figure 4: Read operation vs. update latency In the above fig, comparison between Average Latency and Read Operation. As we can see count 1 average latency is 403.87 and it is decreasing slowly as read operation increasing at count 3 is continually decreasing where avg. latency was 360 in Cassandra database. But in HBase latency is increasing after count 2 till count 3. Which shows that even number of read operations increasing Cassandra giving better performance than HBase. 7.2 Workload B Result: • Overall Insert latency Vs. Overall Throughput Figure 5: Overall Insert latency Vs. Overall Throughput In the above fig. we can see that for workload B for Cassandra and HBase. In Graph even though overall throughput increasing per second average latency is decreasing for
  • 11. 11 Cassandra Database System. But for HBase initially average latency increased but after count 3its decreasing for count 2 andits decreased below Casandra Database. So wecan see that for Workload B initially HBase is performing poor as compare Cassandra but after some count 3 HBase performance is increased. • Read Operation Vs. Overall Read Throughput Figure 6: Read Operation Vs. Overall Read Throughput Above fig. illustrate that comparison between Read operation and Overall Read Throughput, as the Graph describe for count 1 read throughput was more than 2600 and read operation was approximately 95000, as read operation count increase Read throughputforHBasedatabasesystemisdecreasing, indespite ofthatasreadoperation increases for Cassandra overall throughput is also increasing. Hence, we can say that Hbase is performing better than Cassandra.
  • 12. 12 • Update Operation Vs. Update Average Latency Figure 7: Update Operation Vs. Update Average Latency In above fig we can clearly see that as the update operations increasing for count 2 update avg. latency is decreasing till count 2 but after that till count 3 its increasing drastically, in despite of that as update operation increasing avg. latency is decreasing slowly. Initially for count 1 avg. latency was 422.5 but for count 3 it was 397.1. So, we can say that here in this test avg. latency is increasing for HBase but its decreasing for Cassandra.
  • 13. 13 8 Conclusion and Discussion: To conclude the above study, we can say that we conducted the experiment in which the two very prominent Cassandra databases, HBase and Cassendra, were compared based on parameters such as consistency, scalability, performance. In the above experiment, the architecture of both databases was examined along with their key characteristics. The experiment was conducted on the Yahoo! Cloud Serving Benchmark for Workload A and Workload B. All workloads were tested twice, and the paper displayed average output results. We can compare the above results and say that for Workload A with 50 percentReadand50percentonupdateoperationsbothdatabasesperformeddifferently. for Workload A for 100000, 20000 and 300000 record count Hbase performed netter than Cassendra but in Read operation vs. update latency cassandra providing better performance. For Workload B 50 percent Read and 50 percent write operation cassendra giving better performace than HBase. and for Read operation HBAse is deforming better than Cassendra. It is terribly tough to scale the performance of HBase with the increasing employment. Wherein, prophetess is giving USA the expected results with significantly lower latency and high outturn values. From the analysis we will say that the Cassendra hasperformedfarbetterthanHBaseforallthegivenparametersagainsttheBenchmark. Wherein, HBase has well-tried to high inconsistent throughout the experiment. Going forward we will do the tests with a lot of high workloads to work out however these 2 NoSQL databases perform and that one is a lot of appropriate for the operations in enterprises. References (n.d.). Abramova, V., Bernardino, J. & Furtado, P. (2014), Evaluating cassandra scalability withycsb,in‘InternationalConferenceonDatabaseandExpertSystemsApplications’, Springer, pp.199–207. Abubakar, Y., Adeyi, T. S.&Auta, I.G.(2014), ‘Performance evaluation of nosql systems using ycsb in a resource austere environment’, Performance Evaluation 7(8), 23–27. Bakshi,K.(2012),Considerationsforbigdata: Architectureandapproach,in‘2012IEEE Aerospace Conference’, IEEE, pp. 1–7. Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R. & Sears, R. (2010), Bench- marking cloud serving systems with ycsb, in ‘Proceedings of the 1st ACM symposium on Cloud computing’, ACM, pp. 143–154. McCreary, D. & Kelly, A. (2014), ‘Making sense of nosql’, Shelter Island: Manning pp. 19–20. Overview of HBase Architecture and its Components (2016). URL: https://www.dezyre.com/article/overview-of-hbase-architecture-and-its- components/295
  • 14. 14 Tutorialspoint.com (n.d.), ‘Cassandra architecture’. URL: https://www.tutorialspoint.com/cassandra/cassandraarchitecture.htm Waage,T.&Wiese,L.(2014),Benchmarkingencrypteddatastorageinhbaseandcassan- dra with ycsb, in ‘International Symposium on Foundations and Practice of Security’, Springer, pp.311–325. Wang,H.,Li,J.,Zhang, H.&Zhou,Y.(2014),Benchmarkingreplicationandconsistency strategiesincloud serving databases: Hbaseandcassandra, in ‘WorkshoponBig Data Benchmarks,PerformanceOptimization,andEmergingHardware’,Springer,pp.71– 82.