WHAT IS DBMS?
DBMS software primarily functions as an interface between the
end user and the database, simultaneously managing the data, the
database engine, and the database schema in order to facilitate the
organization and manipulation of data.
A database management system (or DBMS) is essentially nothing
more than a computerized data-keeping system.
Users of the system are given facilities to perform several kinds of
operations on such a system for either manipulation of the data in
the database or the management of the database structure itself.
Database Management Systems (DBMSs) are categorized
according to their data structures or types.
There are several types of databases : inverted list,
hierarchic, network, or relational.
WHAT IS RDBMS?
RDBMS stands for Relational Database Management System.
RDBMS is the basis for SQL, and for all modern database
systems such as MS SQL Server, IBM DB2, Oracle, MySQL, and
The software used to store, manage, query, and retrieve data
stored in a relational database is called a relational database
management system (RDBMS).
The RDBMS provides an interface between users and
applications and the database, as well as administrative
functions for managing data storage, access, and performance.
The data in RDBMS is stored in database objects called tables.
A table is a collection of related data entries and it consists of
columns and rows.
LIMITATIONS OF RELATIONAL DATABASES
1. In relational database we need to define structure and schema of
data first and then only we can process the data.
2. Relational database systems provides consistency and integrity of
data by enforcing ACID properties (Atomicity, Consistency,
Isolation and Durability ).
There are some scenarios where this is useful like banking system.
However in most of the other cases these properties are significant
performance overhead and can make your database response very
3. Most of the applications store their data in JSON format and
RDBMS don’t provide you a better way of performing operations such
as create, insert, update, delete etc on this data. On the other hand
NoSQL store their data in JSON format, which is compatible with
most of the today’s world application.
JSON is the data structure of the Web.
It's a simple data format that allows programmers to store
and communicate sets of values, lists, and key-value
mappings across systems.
As JSON adoption has grown, database vendors have
sprung up offering JSON-centric document databases.
JSON is the data structure of the Web.
It's a simple data format that allows programmers to store
and communicate sets of values, lists, and key-value
mappings across systems.
WHAT IS SQL?
SQL (Structured Query Language) is a standardized
programming language that's used to manage relational
databases and perform various operations on the data in
SQL (Structured Query Language) is a standardized
programming language that's used to manage relational
databases and perform various operations on the data in
Initially created in the 1970s, SQL is regularly used not only
by database administrators, but also by developers writing
data integration scripts and data analysts looking to set up
and run analytical queries.
The uses of SQL include modifying database table and index
structures; adding, updating and deleting rows of data; and
retrieving subsets of information from within a database for
transaction processing and analytics applications.
Queries and other SQL operations take the form of commands
written as statements.
Commonly used SQL statements include select, add, insert,
update, delete, create, alter and truncate.
SQL is a domain-specific language used in programming and
designed for managing data held in a relational database
management system, or for stream processing in a relational
data stream management system.
A database Management System provides the
mechanism to store and retrieve the data.
There are different kinds of database
1. RDBMS (Relational Database Management
2. OLAP (Online Analytical Processing)
3. NoSQL (Not only SQL)
BRIEF HISTORY OF NOSQL DATABASES
•1998- Carlo Strozzi use the term NoSQL for his lightweight,
open-source relational database
•2000- Graph database Neo4j is launched
•2004- Google BigTable is launched
•2005- CouchDB is launched
•2007- The research paper on Amazon Dynamo is released
•2008- Facebooks open sources the Cassandra project
•2009- The term NoSQL was reintroduced
WHAT IS NOSQL?
NoSQL, known as Not only SQL database, provides a mechanism for
storage and retrieval of data and is the next generation database .
It has a distributed architecture with MongoDB and is open source.
Most of the NoSQL are open source and it has a capability of
horizontal scalability which means that commodity kind of machines
could be added
The capacity of your clusters can be increased.
It is schema free and there is no requirement to design the tables and
pushing the data to it.
NoSQL provides easy replication claiming there are very less manual
interventions in this. Once the replication is done, the system will
automatically take care of fail overs.
The crucial factor about NoSQL is that it can handle huge
amount of data and can achieve performance by adding more
machines to your clusters and can be implemented on
There are close to 150 NoSQL databases in the market which
will make it difficult to choose the right pick for your system.
One of the advantage of NoSQL database is that they are
really easy to scale and they are much faster in most types of
operations that we perform on database.
There are certain situations where you would prefer relational
database over NoSQL, however when you are dealing with
huge amount of data then NoSQL database is your best choice.
WHY NOSQL? In today’s time data is becoming
easier to access and capture through
third parties such as Facebook,
Google+ and others.
Personal user information, social
graphs, geo location data, user-
generated content and machine
logging data are just a few examples
where the data has been increasing
To avail the above service properly,
it is required to process huge amount
of data. Which SQL databases were
The evolution of NoSQL databases
is to handle these huge data
•More than rows in tables — NoSQL systems store and retrieve data from many formats;
key-value stores, graph databases, column-family (Bigtable) stores, document stores and
even rows in tables.
•Free of joins — NoSQL systems allow you to extract your data using simple interfaces
•Schema free — NoSQL systems allow you to drag-and-drop your data into a folder and then
query it without creating an entity-relational model.
•Compatible with many processors — NoSQL systems allow you to store your database on
multiple processors and maintain high-speed performance.
•Usable on shared-nothing commodity computers — Most (but not all) NoSQL systems
leverage low cost commodity processors that have separate RAM and disk.
•Supportive of linear scalability — NoSQL supports linear scalability; when you add more
processors you get a consistent increase in performance.
•Innovative — NoSQL offers options to a single way of storing, retrieving and manipulating
data. NoSQL supporters (also known as NoSQLers) have an inclusive attitude about NoSQL
and recognize SQL solutions as viable options. To the NoSQL community, NoSQL means not
NoSQL Is Not:
•About the SQL language — The definition of NoSQL is not an application that uses a
language other than SQL. SQL, as well as other query languages, are used with NoSQL
•Not only open source — Although many NoSQL systems have an open source model,
commercial products use NoSQL concepts as well as open source initiatives. You can still
have an innovative approach to problem solving with a commercial product.
•Not only Big Data — Many, but not all NoSQL applications, are driven by the inability of a
current application to efficiently scale when Big Data is an issue. While volume and velocity
are important, NoSQL also focuses on variability and agility.
•About cloud computing — Many NoSQL systems reside in the cloud to take advantage of
its ability to rapidly scale when the situations dictate. NoSQL systems can run in the cloud as
well as in your corporate data center.
•About a clever use of RAM and SSD — Many NoSQL systems focus on the efficient use of
RAM or solid-state disks to increase performance. While important, NoSQL systems can run
on standard hardware.
•An elite group of products — NoSQL is not an exclusive club with a few products. There
are no membership dues or tests required to join.
FEATURES OF NOSQL:
NoSQL has the following features:
2) Schema free
3) Simple API
NoSQL databases never follow the relational model.
Never provide tables with flat fixed-column records.
Work with self-contained aggregates or BLOBs.
(A binary large object is a collection of binary data stored as a single
entity. Blobs are typically images, audio or other multimedia objects,
though sometimes binary executable code is stored as a blob)
Doesn’t require object-relational mapping and data normalization.
No complex features like query languages, query planners,
referential integrity joins, ACID.
NoSQL databases are either schema-free or have relaxed
Do not require any sort of definition of the schema of the data.
Offers heterogeneous structures of data in the same domain.
3) Simple API
Offers easy to use interfaces for storage and querying
APIs allow low-level data manipulation & selection
Text-based protocols mostly used with HTTP REST with
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing
•Multiple NoSQL databases can be executed in a distributed fashion
•Offers auto-scaling and fail-over capabilities
•Often ACID concept can be sacrificed for scalability and throughput
•Mostly no synchronous replication between distributed nodes
Master Replication, peer-to-peer, HDFS Replication
•Only providing eventual consistency
•Shared Nothing Architecture. This enables less coordination and higher
WHEN TO GO FOR NOSQL
When you would want to choose NoSQL over relational database:
1.When you want to store and retrieve huge amount of data.
2.The relationship between the data you store is not that important
3.The data is not structured and changing over time
4.Constraints and Joins support is not required at database level
5.The data is growing continuously and you need to scale the database
regular to handle the data.
NoSQL is a database technology driven by Cloud Computing, the
Web, Big Data and the Big Users.
NoSQL now leads the way for the popular internet companies
such as LinkedIn, Google, Amazon, and Facebook - to overcome
the drawbacks of the 40 year old RDBMS.
NoSQL Database, also known as “Not Only SQL” is an alternative
to SQL database which does not require any kind of fixed table
schemas unlike the SQL.
NoSQL generally scales horizontally and avoids major join
operations on the data. NoSQL database can be referred to as
structured storage which consists of relational database as the
NoSQL Database covers a swarm of multitude databases, each
having a different kind of data storage model.
The most popular types are Graph, Key-Value pairs, Columnar
WHAT IS MEAN BY BUSINESS DRIVERS?
Business drivers are the key inputs and activities that drive
the operational and financial results of a business.
Common examples of business drivers are salespeople, number
of stores, website traffic, number and price of products sold, units
of production, etc.
THERE ARE 4 MAJOR BUSINESS DRIVERS FOR
Without a doubt, the key factor pushing organizations to look at alternatives to
their current RDBMSs is a need to query Big Data using clusters of commodity
Until around 2005, performance concerns were resolved by purchasing faster
In time, however, the ability to increase processing speed was no longer an
As chip density increased heat could no longer dissipate fast enough without chip
overheating. This phenomenon, known as the PowerWall, forced systems
designers to shift their focus from increasing speed on a single chip to using more
processors working together.
The need to scale out (also known as horizontal scaling), rather than scale up
(faster processors), moved organizations from serial to parallel processing where
data problems are split into separate paths and sent to separate processors to
divide and conquer the work.
While Big Data problems are a consideration for many
organizations moving away from RDBMS systems, the ability of a
single processor system to rapidly read and write data is also key.
Many single processor RDBMS systems are unable to keep up
with the demands of real-time inserts and online queries to the
database made by public-facing websites.
RDBMS systems frequently index many columns of every new
row, a process that decreases system performance.
When single processors RDBMSs are used as a back end to a web
storefront, the random bursts in web traffic slow down response
for everyone and tuning these systems can be costly when both
high read and write throughput is desired.
Companies that want to capture and report on exception
data struggle when attempting to use rigid database schema
structures imposed by RDBMS systems.
For example, if a business unit wants to capture a few
custom fields for a particular customer, all customer rows
within the database need to store this information even
though it doesn't apply.
Adding new columns to an RDBMS requires the system to
be shut down and ALTER TABLE commands to be run.
When a database is large, this process can impact system
availability, losing time and money in the process.
The most complex part of building applications using RDBMSs is
the process of putting data into and getting data out of the
If your data has nested and repeated subgroups of data
structures you need to include an object-relational mapping layer.
The responsibility of this layer is to generate the correct
combination of INSERT, UPDATE, DELETE and SELECT SQL
statements to move object data to and from the RDBMS
This process is not simple and is associated with the largest
barrier to rapid change when developing new or modifying
CAP THEOREM (BREWER’S
You must understand the CAP theorem when
you talk about NoSQL databases or in fact when
designing any distributed system.
CAP theorem states that there are three basic
requirements which exist in a special relation
when designing applications for a distributed
This means that the data in the database remains consistent after
the execution of an operation.
For example after an update operation all clients see the same
This means that the system is always on (service guarantee
availability), no downtime.
Partition Tolerance -
This means that the system continues to function even the
communication among the servers is unreliable, i.e. the servers
may be partitioned into multiple groups that cannot communicate
with one another.
In theoretically it is impossible to fulfill all 3 requirements.
CAP provides the basic requirements for a distributed system to
follow 2 of the 3 requirements.
Therefore all the current NoSQL database follow the different
combinations of the C, A, P from the CAP theorem.
Here is the brief description of three combinations CA, CP, AP :
CA - Single site cluster, therefore all nodes are always in contact.
When a partition occurs, the system blocks.
CP -Some data may not be accessible, but the rest is still
AP - System is still available under partitioning, but some of the data
returned may be inaccurate.
•Schema flexibility, semi-structure data
•No complicated Relationships
•Limited query capabilities (so far)
•Eventual consistent is not intuitive to program for
ADVANTAGES OF NOSQL:
•Can be used as Primary or Analytic Data Source
•Big Data Capability
•No Single Point of Failure
•No Need for Separate Caching Layer
•It provides fast performance and horizontal scalability.
•Can handle structured, semi-structured, and unstructured data with equal effect
•Object-oriented programming which is easy to use and flexible
•NoSQL databases don’t need a dedicated high-performance server
•Support Key Developer Languages and Platforms
•Simple to implement than using RDBMS
•It can serve as the primary data source for online applications.
•Handles big data which manages data velocity, variety, volume, and complexity
•Excels at distributed database and multi-data center operations
•Eliminates the need for a specific caching layer to store data
•Offers a flexible schema design which can easily be altered without downtime or service disruption
DISADVANTAGES OF NOSQL
•No standardization rules
•Limited query capabilities
•RDBMS databases and tools are comparatively mature
•It does not offer any traditional database capabilities,
like consistency when multiple transactions are
•When the volume of data increases it is difficult to
maintain unique values as keys become difficult
•Doesn’t work as well with relational data
•The learning curve is stiff for new developers
•Open source options so not so popular for enterprises.
NoSQL Databases are mainly categorized into four
types: Key-value pair, Column-oriented, Graph-based
Every category has its unique attributes and limitations.
None of the above-specified database is better to solve
Users should select the database based on their
Types of NoSQL Databases:
•Key-value Pair Based
1) KEY VALUE PAIR BASED
Data is stored in key/value pairs. It is
designed in such a way to handle lots of data
and heavy load.
Key-value pair storage databases store data as
a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large
Objects), string, etc.
For example, a key-value pair may contain a
key like “Website” associated with a value like
It is one of the most basic NoSQL database
Key value stores help the
developer to store
work best for shopping
Redis, Dynamo, Riak are
some NoSQL examples of
DataBases.They are all
based on Amazon’s
Column-oriented databases work on
columns and are based on BigTable paper
Every column is treated separately.
Values of single column databases are
They deliver high performance on
aggregation queries like SUM, COUNT,
AVG, MIN etc. as the data is readily
databases are widely used to
intelligence, CRM, Library
HBase, Cassandra, HBase,
Hypertable are NoSQL query
examples of column based
AGGREGATE FUNCTIONS IN
1. SQL provides a number of built-in functions to perform
operations on data these functions are very much useful
for performing mathematical calculations on table data
2. Aggregate functions return a single value after
performing calculations on a set of values, here will
discuss the five frequently used aggregate functions
provided by SQL.
3. These aggregate functions are used with the SELECT
statement at a time only one column can be applied with
BIGTABLE PAPER BY GOOGLE:
Bigtable is a distributed storage system for managing structured data that is designed to
scale to a very large size: petabytes of data across thousands of commodity servers.
Many projects at Google store data in Bigtable, including web indexing, Google Earth,
and Google Finance.
These applications place very different demands on Bigtable, both in terms of data size
(from URLs to web pages to satellite imagery) and latency requirements (from backend
bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully provided a flexible, high-
performance solution for all of these Google products.
In this paper we describe the simple data model provided by Bigtable, which gives
clients dynamic control over data layout and format, and we describe the design and
implementation of Bigtable.
Document-Oriented NoSQL DB stores and retrieves data as a
key-value pair but the value part is stored as a document.
The document is stored in JSON or XML formats.
In the following diagram you can see we have rows and
columns, and in the right, we have a document database
which has a similar structure to JSON.
Now for the relational database, you have to know what
columns you have and so on.
However, for a document database, you have data store like
JSON object. You do not require to define which make it
value is understood by the DB and can be queried.
The document type is mostly used for CMS systems, blogging platforms,
real-time analytics & e-commerce applications.
It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, are popular
Document originated DBMS systems.
A content management system, often abbreviated as CMS, is software that helps users create, manage,
and modify content on a website without the need for specialized technical knowledge.
In simpler language, a content management system is a tool that helps you build a website without needing
to write all the code from scratch (or even know how to code at all).
Instead of building your own system for creating web pages, storing images, and other functions, the
content management system handles all that basic infrastructure stuff for you so that you can focus on more
forward-facing parts of your website.
A graph type database stores entities as well the
relations amongst those entities.
The entity is stored as a node with the
relationship as edges.
An edge gives a relationship between nodes.
Every node and edge has a unique identifier.
Compared to a relational database where tables
are loosely connected, a Graph database is a multi-
relational in nature.
Traversing relationship is fast as they are already
captured into the DB, and there is no need to
Graph base database
mostly used for social
networks, logistics, spatial
Neo4J, Infinite Graph,
OrientDB, FlockDB are
some popular graph-
A graph database is a database that is based on graph
theory. It consists of a set of objects, which can be a node
or an edge.
•Nodes represent entities or instances such as people,
businesses, accounts, or any other item to be tracked. They
are roughly the equivalent of a record, relation, or row in a
relational database, or a document in a document-store
•Edges, also termed graphs or relationships, are the lines
that connect nodes to other nodes; representing the
relationship between them. Meaningful patterns emerge
when examining the connections and interconnections of
nodes, properties and edges. The edges can either be
directed or undirected.
•In an undirected graph, an edge connecting two nodes
has a single meaning.
•In a directed graph, the edges connecting two different
nodes have different meanings, depending on their
•Edges are the key concept in graph databases,
representing an abstraction that is not directly
implemented in a relational model.
•Properties are information associated to nodes.
•For example, if Wikipedia were one of the nodes, it might
be tied to properties such as website, reference material,
or words that starts with the letter w, depending on which
aspects of Wikipedia are germane to a given database.
Using a NoSQL solution to solve your big data problems gives you
some unique ways to handle and manage your big data.
By moving data to queries, using hash rings to distribute the load,
using replication to scale your reads, and allowing the database to
distribute queries evenly to your data nodes, you can manage your data
and keep your systems running fast.
Now let’s look at how NoSQL systems, with their inherently
horizontal scale-out architectures, are ideal for tackling big data
We’ll look at several strategies that NoSQL systems use to scale
horizontally on commodity hardware.
We’ll see how NoSQL systems move queries to the data, not data to
We’ll see how they use the hash rings to evenly distribute the data on
a cluster and use replication to scale reads.
All these strategies allow NoSQL systems to distribute the workload
evenly and eliminate performance bottlenecks.
So what exactly is a big data problem?
A big data class problem is any business problem that’s
so large that it can’t be easily managed using a single
Big data problems force you to move away from a single-
processor environment toward the more complex world of
Though great for solving big data problems, distributed
computing environments come with their own set of
One of the core concepts in big data is linear scaling.
When a system has linear scaling, you automatically get a proportional performance
gain each time you add a new processor to your cluster, as shown in fig.
Scaling independent transformations —
Many big data problems are driven by discrete transformations on individual items
without interaction among the items.
These types of problems tend to be the easiest to solve: simply add a new node to
Image transformation is a good example of this.
Scaling availability —
Duplicate the writes onto multiple servers in data centers in distinct geographic
If one data center experiences an outage, the other data centers can supply the data.
Scaling availability keeps replica copies in sync and automates the switchover if one
2. UNDERSTANDING LINEAR SCALABILITY AND
As we mentioned earlier, linear scalability is the ability to get a
consistent amount of performance improvement as you add
additional processors to your cluster.
Expressivity is the ability to perform fine-grained queries on
individual elements of your dataset.
Understanding how well each NoSQL technology performs in terms
of scalability and expressivity is necessary when you’re selecting a
To select the right system, you’ll need to identify the scalability and
expressivity requirements of your system and then make sure the
system that you select meets both of these criteria.
Scalability and expressivity can be difficult to quantify, and vendor
claims may not match actual performance for a particular business
3. UNDERSTANDING THE TYPES OF BIG DATA PROBLEMS
There are many types of big data problems, each requiring a different
combination of NoSQL systems.
After you’ve categorized your data and determined its type, you’ll find there are
How you build your own big data classification system might be different from
this example, but the process of differentiating data types should be similar.
Read-mostly data is the most common classification.
It includes data that’s created once and rarely altered.
This type of data is typically found in data warehouse applications but is also
identified as a set of non-RDBMS items like images or video, event-logging data,
published documents, or graph data.
Event data includes things like retail sales events, hits on a website, system
logging data, or real-time sensor data.
Full-text documents —
This category of data includes any document that contains natural-
language text like the English language.
An important aspect of document stores is that you can query the
entire contents of your office document in the same way you would
query rows in your SQL system.
This means that you can create new reports that combine traditional
data in RDBMSs as well as the data within your office documents.
For example, you could create a single query that extracted all the
authors of titles of PowerPoint slides that contained the
keywords NoSQL or big data.
3. ANALYSING BIG DATA WITH A SHARED-NOTHING
Three ways to share resources.
The left panel shows a shared RAM architecture, where many
CPUs access a single shared RAM over a high-speed bus.
This system is ideal for large graph traversal.
The middle panel shows a shared disk system, where processors
have independent RAM but share disk using a storage area
The right panel shows an architecture used in big data
solutions: cache-friendly, using low-cost commodity hardware,
and a shared-nothing architecture.
There are three ways that resources can be shared between
computer systems: shared RAM, shared disk, and shared-
Figure shows a comparison of these three distributed computing
Of the architectural data patterns we’ve discussed
so far (row store, key-value store, graph store,
document store, and Bigtable store), only two (key-
value store and document store) lend themselves
Bigtable stores scale well on shared-nothing
architectures because their row-column identifiers
are similar to key-value stores.
But row stores and graph stores aren’t cache-
friendly since they don’t allow a large BLOB to be
referenced by a short key that can be stored in the
For graph traversals to be fast, the entire graph
should be in main memory.
This is why graph stores work most efficiently
when you have enough RAM to hold the graph.
If you can’t keep your graph in RAM, graph stores
will try to swap the data to disk, which will
decrease graph query performance by a factor of
The only way to combat the problem is to move
to a shared-memory architecture, where multiple
threads all access a large RAM structure without
the graph data moving outside of the shared RAM.
4. CHOOSING DISTRIBUTION MODELS: MASTER-SLAVE
From a distribution perspective, there are two main models: master-
slave and peer-to-peer. Distribution models determine the
responsibility for processing data when a request is made.
Master-slave versus peer-to-peer—the panel on the left
illustrates a master-slave configuration where all
incoming database requests (reads or writes) are sent to
a single master node and redistributed from there.
The master node is called the NameNode in Hadoop.
This node keeps a database of all the other nodes in the
cluster and the rules for distributing requests to each
The panel on the right shows how the peer-to-peer
model stores all the information about the cluster on each
node in the cluster.
If any node crashes, the other nodes can take over and
processing can continue.
With a master-slave distribution model, the role
of managing the cluster is done on a single master
This node can run on specialized hardware such
as RAID drives to lower the probability that it crashes.
The cluster can also be configured with a standby
master that’s continually updated from the master
The challenge with this option is that it’s difficult to
test the standby master without jeopardizing the
health of the cluster.
Failure of the standby master to take over from the
master node is a real concern for high-availability
The initial versions of Hadoop (frequently
referred to as the 1.x versions) were designed to
use a master-slave architecture with the
NameNode of a cluster being responsible for
managing the status of the cluster.
NameNodes usually don’t deal with any
MapReduce data themselves.
Their job is to manage and distribute queries to
the correct nodes on the cluster.
Hadoop 2.x versions are designed to remove
single points of failure from a Hadoop cluster.
One of the strengths of a Hadoop system is that it’s designed to
work directly with a filesystem that supports big data problems.
As you’ll see, Hadoop makes big data processing easier by using a
filesystem structure that’s different from a traditional system.
The Hadoop Distributed File System (HDFS) provides many of the
supporting features that MapReduce transforms need to be efficient
Unlike an ordinary filesystem, it’s customized for transparent,
reliable, write-once, read-many operations.
You can think of HDFS as a fault-tolerant, distributed, key-value
store tuned to work with large files.
HDFSs are different: they use a large (64 megabytes by default)
block size to handle data.
Figure shows how large HDFS blocks are compared to a typical
5.MapReduce and distributed filesystems
The size difference between a filesystem
block size on a typical desktop or UNIX
operating system (4 KB) and the logical
block size within the Apache Hadoop
Distributed File System (64 MB), which is
optimized for big data transforms.
The default block size defines a unit of
work for the filesystem.
The fewer blocks used in a transfer, the
more efficient the transfer process.
The downside of using large blocks Is that
if data doesn’t fill an entire physical block,
the empty section of the block can’t be
6. HOW MAPREDUCE ALLOWS EFFICIENT
TRANSFORMATION OF BIG DATA PROBLEMS:
MapReduce is a core component in many big data solutions.
Figure provides a detailed look at the internal components of
a MapReduce job.
The basics of how the map and reduce functions work
together to gain linear scalability over big data transforms.
The map operation takes input data and creates a uniform
set of key-value pairs.
In the shuffle phase, which is done automatically by the
MapReduce framework, key-value pairs are automatically
distributed to the correct reduce node based on the value of
The reduce operation takes the key-value pairs and returns
consolidated values for each key.
It’s the job of the MapReduce framework to get the right
keys to the right reduce nodes.
7.USING REPLICATION TO SCALE READS
How you can replicate data to speed read performance in NoSQL systems.
All incoming client requests enter from the left. All reads can be directed to any node,
either a primary read/write node or a replica node.
All write transactions can be sent to a central read/write node that will update the data
and then automatically send the updates to replica nodes.
The time between the write to the primary and the time the update arrives on the
replica nodes determines how long it takes for reads to return consistent results.
Since 1970, RDBMS is the solution for data storage and
maintenance related problems.
After the advent of big data, companies realized the benefit of
processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and
MapReduce to process it.
Hadoop excels in storing and processing of huge data of various
formats such as arbitrary, semi-, or even unstructured.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be
accessed only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set,
which should also be processed sequentially. At this point, a new
solution is needed to access any point of data in a single unit of
time (random access).
WHAT IS HBASE?
Hbase is an open source and sorted map data built on
Hadoop. It is column oriented and horizontally scalable.
It is based on Google's Big Table.
It has set of tables which keep data in key value format.
Hbase is well suited for sparse data sets which are very
common in big data use cases.
Hbase provides APIs enabling development in practically
any programming language.
It is a part of the Hadoop ecosystem that provides random
real-time read/write access to data in the Hadoop File
•RDBMS get exponentially slow as the data becomes
•Expects data to be highly structured, i.e. ability to fit in
a well-defined schema.
•Any change in schema might require a downtime.
•For sparse datasets, too much of overhead of
maintaining NULL values.
FEATURES OF HBASE:
•Horizontally scalable: You can add any number of columns anytime.
•Automatic Failover: Automatic failover is a resource that allows a system administrator
to automatically switch data handling to a standby system in the event of system
•Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File System.
•sparse, distributed, persistent, multidimensional sorted map, which is indexed by row-key,
column-key and timestamp.
•Often referred as a key value store or column family-oriented database, or storing
versioned maps of maps.
•fundamentally, it's a platform for storing and retrieving data with random access.
•It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
•It doesn't enforce relationships within your data.
•It is designed to run on a cluster of computers, built using commodity hardware.
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write access.
WHERE TO USE HBASE?
•Apache HBase is used to have random, real-time read/write
access to Big Data.
•It hosts very large tables on top of clusters of commodity
•Apache HBase is a non-relational database modelled after
•Bigtable acts up on Google File System, likewise Apache
HBase works on top of Hadoop and HDFS.
HBase is a column-oriented NoSQL database.
Although it looks similar to a relational database which contains rows
and columns, but it is not a relational database.
Relational databases are row oriented while HBase is column-
So, let us first understand the difference between Column-oriented
and Row-oriented databases:
1. ROW-ORIENTED NOSQL:
•Row-oriented databases store table records in a sequence of rows.
•To better understand it, let us take an example and consider the table below.
•If this table is stored in a row-oriented database. It will store the records as shown below:
1, Paul Walker, US, 231, Gallardo,
2, Vin Diesel, Brazil, 520, Mustang
In row-oriented databases data is stored on the basis of rows or tuples as you can see above.
Whereas column-oriented databases store table records in a sequence of
columns, i.e. the entries in a column are stored in contiguous locations on
In a column-oriented databases, all the column values are stored together
like first column values will be stored together, then the second column values
will be stored together and data in other columns are stored in a similar
The column-oriented databases store this data as:
1,2, Paul Walker, Vin Diesel, US, Brazil, 231, 520, Gallardo, Mustang
When the amount of data is very huge, like in terms of petabytes or exabytes, we
use column-oriented approach, because the data of a single column is stored
together and can be accessed faster.
While row-oriented approach comparatively handles less number of rows and
columns efficiently, as row-oriented database stores data is a structured format.
When we need to process and analyze a large set of semi-structured or
unstructured data, we use column oriented approach. Such as applications dealing
with Online Analytical Processing like data mining, data warehousing, applications
including analytics, etc.
Whereas, Online Transactional Processing such as banking and finance
domains which handle structured data and require transactional properties (ACID
properties) use row-oriented approach.
HBASE TABLES HAS FOLLOWING
COMPONENTS, SHOWN IN THE IMAGE
•Tables: Data is stored in a table format in HBase. But here tables are in
•Row Key: Row keys are used to search records which make searches
fast. You would be curious to know how? I will explain it in the
in this blog.
•Column Families: Various columns are combined in a column family.
These column families are stored together which makes the searching
data belonging to same column family can be accessed together in a
•Column Qualifiers: Each column’s name is known as its column
•Cell: Data is stored in cells. The data is dumped into cells which are
specifically identified by row-key and column qualifiers.
•Timestamp: Timestamp is a combination of date and time. Whenever
data is stored, it is stored with its timestamp. This makes easy to
In a more simple and understanding way, we
can say HBase consists of:
•Set of tables
•Each table with column families and rows
•Row key acts as a Primary key in HBase.
•Any access to HBase tables uses this Primary
•Each column qualifier present in HBase denotes
corresponding to the object which resides in the
HBASE ARCHITECTURE AND ITS
HBase architecture consists mainly of five
HMaster in HBase is the implementation of a Master server in
It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface for all the
In a distributed cluster environment, Master runs on NameNode.
Master runs several background threads.
The following are important roles performed by HMaster
1. HMaster Plays a vital role in terms of performance and
maintaining nodes in the cluster.
2. HMaster provides admin performance and distributes
different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load
failover to handle the load over nodes present in the
5. When a client wants to change any schema and to
Metadata operations, HMaster takes responsibility for
Some of the methods exposed by HMaster Interface are primarily
Metadata oriented methods.
•Table (createTable, removeTable, enable, disable)
•ColumnFamily (add Column, modify Column)
•Region (move, assign)
The client communicates in a bi-directional way with both
For read and write operations, it directly contacts with HRegion
HMaster assigns regions to region servers and in turn, check the
status of region servers.
In entire architecture, we have multiple region servers.
Hlog present in region servers which are going to store all the
2) HBASE REGION SERVERS
When HBase Region Server receives writes and read requests from the client, it assigns the
request to a specific region, where the actual column family resides.
However, the client can directly contact with HRegion servers, there is no need of HMaster
mandatory permission to the client regarding communication with HRegion servers.
The client requires HMaster help when operations related to metadata and schema
changes are required.
HRegionServer is the Region Server implementation.
It is responsible for serving and managing regions or data that is present in a distributed
cluster.The region servers run on Data Nodes present in the Hadoop cluster.
HMaster can get into contact with multiple HRegion servers and performs the following
•Hosting and managing regions
•Splitting regions automatically
•Handling read and writes requests
A Region Server maintains various regions running on the top of HDFS. Components
of a Region Server are:
•WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file
attached to every Region Server inside the distributed environment. The WAL stores
persisted or committed to the permanent storage. It is used in case of failure to
•Block Cache: From the above image, it is clearly visible that Block Cache resides in
the top of Region Server. It stores the frequently read data in the memory. If the data
recently used, then that data is removed from BlockCache.
•MemStore: It is the write cache. It stores all the incoming data before committing it
to the disk or permanent memory. There is one MemStore for each column family in
the image, there are multiple MemStores for a region because each region contains
data is sorted in lexicographical order before committing it to the disk.
•HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the
actual cells on the disk. MemStore commits the data to HFile when the size of
3) HBASE REGIONS
HRegions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families.
It contains multiple stores, one for each column family.
It consists of mainly two components, which are Memstore and Hfile.
So, concluding in a simpler way:
•A table can be divided into a number of regions. A Region is a sorted range of rows storing data
between a start key and an end key.
•A Region has a default size of 256MB which can be configured according to the need.
•A Group of regions is served to the clients by a Region Server.
•A Region Server can serve approximately 1000 regions to the client.
HBase Zookeeper is a centralized monitoring server
which maintains configuration information and
provides distributed synchronization.
Distributed synchronization is to access the
distributed applications running across the cluster
with the responsibility of providing coordination
services between nodes.
If the client wants to communicate with regions, the
server’s client has to approach ZooKeeper first.
It is an open source project, and it provides so
many important services.
•Zookeeper acts like a coordinator inside HBase distributed
environment. It helps in maintaining server state inside the
cluster by communicating through sessions.
•Every Region Server along with HMaster Server sends
continuous heartbeat at regular interval to Zookeeper and it
checks which server is alive and available as mentioned in
above image. It also provides server failure notifications so
that, recovery measures can be executed.
•Referring from the above image you can see, there is an
inactive server, which acts as a backup for active server. If
the active server fails, it comes for the rescue.
•The active HMaster sends heartbeats to the Zookeeper while
the inactive HMaster listens for the notification send by active
HMaster. If the active HMaster fails to send a heartbeat the
session is deleted and the inactive HMaster becomes active.
•While if a Region Server fails to send a heartbeat, the session is
expired and all listeners are notified about it. Then HMaster
performs suitable recovery actions which we will discuss later
in this blog.
•Zookeeper also maintains the .META Server’s path, which
helps any client in searching for any region. The Client first has
to check with .META Server in which Region Server a region
belongs, and it gets the path of that Region Server.
•The META table is a special HBase catalog table. It maintains a list of
all the Regions Servers in the HBase storage system, as you can see
in the above image.
•Looking at the figure you can see, .META file maintains the table in
form of keys and values. Key represents the start key of the region and
its id whereas the value contains the path of the Region Server.
Services provided by ZooKeeper
•Maintains Configuration information
•Provides distributed synchronization
•Client Communication establishment with region servers
•Provides ephemeral nodes for which represent different
•Master servers usability of ephemeral nodes for
servers in the cluster
•To track server failure and network partitions
HDFS is a Hadoop distributed File System, as the name implies it provides a
distributed environment for the storage and it is a file system designed in a way
to run on commodity hardware.
It stores each file in multiple blocks and to maintain fault tolerance, the blocks
are replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity
By adding nodes to the cluster and performing processing & storing by using
the cheap commodity hardware, it will give the client better results as compared
to the existing one.
HDFS get in contact with the HBase components and stores a large amount of
amount of data in a distributed manner.
HBase is a column-oriented database and data is stored in tables.
The tables are sorted by RowId. As shown above, HBase has RowId, which is the
collection of several column families that are present in the table.
The column families that are present in the schema are key-value pairs. If we
observe in detail each column family having multiple numbers of columns.The
column values stored into disk memory.
Each cell of the table has its own Metadata like timestamp and other
Coming to HBase the following are the key terms representing table schema
•Table: Collection of rows present.
•Row: Collection of column families.
•Column Family: Collection of columns.
•Column: Collection of key-value pairs.
•Namespace: Logical grouping of tables.
HOW ARE REQUESTS HANDLED
IN HBASE ARCHITECTURE?
3 mechanisms are followed to handle the
requests in Hbase Architecture:
1. Commence the Search in HBase Architecture
2. Write Mechanism in HBase Architecture
3. Read Mechanism in HBase Architecture
1. COMMENCE THE SEARCH IN
The steps to initialize the search are:
1.The user retrieves the Meta table from
ZooKeeper and then requests for the location
of the relevant Region Server.
2.Then the user will request the exact data
from the Region Server with the help of
The write mechanism goes through the following process
sequentially (refer to the above image):
Whenever the client has a write request, the client writes the data to
(Write Ahead Log).
•The edits are then appended at the end of the WAL file.
•This WAL file is maintained in every Region Server and Region Server
recover data which is not committed to the disk.
Once data is written to the WAL, then it is copied to the MemStore.
Once the data is placed in MemStore, then the client receives the
When the MemStore reaches the threshold, it dumps or commits the
HBase Write Mechanism- MemStore
•The MemStore always updates the data stored in it, in a
lexicographical order (sequentially in a dictionary manner)
Key-Values. There is one MemStore for each column
the updates are stored in a sorted manner for each column
•When the MemStore reaches the threshold, it dumps all the
a new HFile in a sorted manner. This HFile is stored in
contains multiple HFiles for each Column Family.
•Over time, the number of HFile grows as MemStore dumps
•MemStore also saves the last written sequence number, so
Server and MemStore both knows, that what is committed
where to start from. When region starts up, the last
is read, and from that number, new edits start.
HBase Architecture: HBase Write Mechanism- HFile
•The writes are placed sequentially on the disk.
movement of the disk’s read-write head is very less.
write and search mechanism very fast.
•The HFile indexes are loaded in memory whenever an
opened. This helps in finding a record in a single
•The trailer is a pointer which points to the HFile’s
is written at the end of the committed file. It contains
information about timestamp and bloom filters.
•Bloom Filter helps in searching key value pairs, it
which does not contain the required rowkey.
helps in searching a version of the file, it helps in
3. READ MECHANISM IN
To read any data, the user will first have to access the
relevant Region Server. Once the Region Server is known,
the other process includes:
1.The first scan is made at the read cache, which is the
2.The next scan location is MemStore, which is the write
3.If the data is not found in block cache or MemStore, the
scanner will retrieve the data from HFile.
Apparemment, vous utilisez un bloqueur de publicités qui est en cours d'exécution. En ajoutant SlideShare à la liste blanche de votre bloqueur de publicités, vous soutenez notre communauté de créateurs de contenu.
Vous détestez les publicités?
Nous avons mis à jour notre politique de confidentialité.
Nous avons mis à jour notre politique de confidentialité pour nous conformer à l'évolution des réglementations mondiales en matière de confidentialité et pour vous informer de la manière dont nous utilisons vos données de façon limitée.
Vous pouvez consulter les détails ci-dessous. En cliquant sur Accepter, vous acceptez la politique de confidentialité mise à jour.