The document provides an introduction and overview of NoSQL databases. It discusses why NoSQL databases were created, the different categories of NoSQL databases including column stores, document stores, and key-value stores. It also provides an overview of Hadoop, describing it as a framework that allows distributed processing of large datasets across computer clusters.
4. Overview of NoSQL (Contd…)
NoSQL doesn’t mean to stop using SQL or SQL won’t be used.
The term refers to those databases that differ from relational databases.
Simply Non-relational databases.
NoSQL is a non-relational database management systems, different from
traditional relational database management systems in some significant ways.
It is designed for distributed data stores where very large scale of data storing
needs (for example Google or Facebook which collects terabits of data every
day for their users). These type of data storing may not require fixed schema,
avoid join operations and typically scale horizontally.
5. NoSQL databases are eventually consistent / CAP (not ACID).
CAP theorem:
Consistency - This means that the data in the database remains consistent
after the execution of an operation. For example after an update operation all
clients see the same data.
Availability - This means that the system is always on (service guarantee
availability), no downtime.
Node failures do not prevent survivors from continuing to operate
Partition Tolerance - This means that the system continues to function even
the communication among the servers is unreliable, i.e. the servers may be
partitioned into multiple groups that cannot communicate with one another.
Overview of NoSQL (Contd…)
6. Overview of NoSQL (Contd…)
NoSQL Features:
1. Scalability
To maintain performance.
Horizontal Scalability:
To increase the number of machines but maintaining proportional
performance.
Vertical scalability:
To add more resources to your single machine to optimize
performance
2. Open Source
Most of the NoSQL Projects are Open source. So any one can use, modify
it, like
Cassandra by facebook.
Bigtable by Google but only allowed for Google application.
7. 3. Schema Freeness
NoSQL databases doesn’t use any fixed schema like relational database.
Internal schema
External schema etc
The original intention of NoSQL is the modern web-scale databases.
There are large number of companies using NoSQL. To name a few :
• Google
• Facebook
• Mozilla
• Adobe
Overview of NoSQL (Contd…)
• Foursquare
• LinkedIn
• Digg
• McGraw-Hill Education
8. WHY NOSQL?
Benefits of NOSQL:
1. Scaling
RDBs weren’t easy to scale out.
On the other hand NoSQL DBs are specially designed to scale out.
2. Big data
Single RDBMS is almost unable to handle today’s huge amount of data and
the transaction on that data.
But
Non-Relational databases are specially designed to handle big data.
Data is becoming easier to capture and access through third parties such as
Facebook, D&B, and others. Personal user information, geo location data,
social graphs, user-generated content, machine logging data, and sensor-
generated data are just a few examples of the ever-expanding array of data
being captured.
3. Needs no Expert DBAs
Although RDMS vendors claim that RDBMS provide management facilities
but it still need an expert DBA to operate it.
In contrast NoSQL DBs don’t need expert DBAs, as it provides automatic
repair, data distribution, and simpler data models, which lead to lower
administration.
9. WHY NOSQL? (CONTD…)
4. Economics
RDBMS requires expensive components for providing efficient service.
NoSQL uses cheap commodity servers to manage the same amount of
data for which RDBMS needs expensive server. So NoSQL is economical
as well.
5. Flexibility of data models
There can occur changes in the requirements of an organization with the
passage of time. Changes in RDBMS after its deployment creates
many problems and also affects its services or some time it’s even almost
impossible to make changes. NoSQL database can be changed at
any instance, i.e. existing columns can be altered and new can be added.
10. WHY NOSQL? (CONTD…)
Scale up with relational technology: limitations at the database tier
Source: http://www.couchbase.com/why-nosql/nosql-database
11. WHY NOSQL? (CONTD…)
Source: http://www.couchbase.com/why-nosql/nosql-database
Scale out with NoSQL technology at the database tier
14. CATEGORIES OF NOSQL DATABASES
There is a variety of types:
• Column Store – Each storage block contains data from only one column
• Document Store – stores documents made up of tagged elements
• Key-Value Store – Hash table of keys
1. Column Store
• Each storage block contains data from only one column
• Example: Hadoop/Hbase
http://hadoop.apache.org/
Clients : Yahoo, Facebook
• Example: Ingres VectorWise
Column Store integrated with an SQL database
• More efficient than row (or document) store if:
Multiple row/record/documents are inserted at the same time so updates of
column blocks can be aggregated
Retrievals access only some of the columns in a row/record/document
15. CATEGORIES OF NOSQL DATABASES (CONTD…)
2. Document Store:
• It stores documents made up of tagged elements.
• Example: CouchDB
http://couchdb.apache.org/
Clients - BBC
• Example: MongoDB
http://www.mongodb.org/
Clients - Foursquare, Shutterfly
16. CATEGORIES OF NOSQL DATABASES (CONTD…)
3. Key-Value Store:
• Hash table of keys
• Values stored with Keys
• Fast access to small data values
• Example – Project-Voldemort
http://www.project-voldemort.com/
Clients : Linkedin
• Example – MemCacheDB
http://memcachedb.org/
17. HADOOP - OVERVIEW
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple
programming models.
It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be prone to failures.
The Apache Hadoop framework is composed of the following modules :
Hadoop Common - contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data
on the commodity machines, providing very high aggregate bandwidth across the
cluster.
Hadoop YARN - a resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce - a programming model for large scale data processing.