2. Agenda
Introduction to NOSQL
Objective
Examples of NOSQL databases
NOSQL vs SQL
Conclusion
3. Basic Concepts
Database – is a organized collection of data.
Data base Management System (DBMS)- is a software
package with computer program that controls the
creation , maintainance & use of a database.
for DBMS , we use structured language to interact with it
Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.
Relational DBMS - A relational database is a
collection of data items organized as a set of formally
described tables from which data can be accessed easily.
A relational database is created using the relational
model. The software used in a relational database is
called a relational database management
system (RDBMS).
4. SQL
Stuctured Query Language
Special purpose programming language designed for
managing data in RDBMS.
Origininally based upon relational algebra & tuple relation
calculas.
SQl’s scope include data insert,upadte & delete, schema
creation and modification , data access control.
It is static and strong used in database.
Most used widely used database language.
Query is the most important operation in SQL.
Ex. SELECT *
FROM Book
WHERE price > 100.00
ORDER BY title;
5. NOSQL
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do
they use the concept of joins
All NOSQL offerings relax one or more of the ACID
properties .
Atomicity , Consistancy , Isolation , Durability ( ACID )
“NOSQL” = “Not Only SQL” =
Not Only using traditional relational DBMS
6. NOSQL
• Alternative to traditional relational DBMS
• Flexible schema
• Quicker/cheaper to set up
• Massive scalability
• Relaxed consistency higher performance &
availability
* No declarative query language more programming
* Relaxed consistency fewer guarantees
7. Why NOSQL?
Every problem cannot be solved by traditional
relational database system exclusively.
Handles huge databases.
Redundancy, data is pretty safe on commodity
hardware
Super flexible queries using map/reduce
Rapid development (no fixed schema, yeah!)
Very fast for common use cases
8. Contd..
Inspired by Distributed Data Storage problems
Scale easily by adding servers
Not suited to all problem types, but super-suited to
certain large problem types
High-write situations (eg activity tracking or timeline
rendering for millions of users)
A lot of relational uses are really dumbed down (eg
fetch by PK with update)
10. How does it work?
Clients know how to:
Send items to servers (consistent hashing)
What to do when a server fails
How to fetch keys from servers
Can “weigh” to server capacities
Servers know how to:
Store items they receive
Expire them from the cache
No inter-server comms – everything is unaware
11. Performance
RDBMS uses buffer to ensure ACID properties
NoSQL does not guarantee ACID and is therefore
much faster
We don’t need ACID everywhere!
Ex. Data processing (every minute) is 4x faster with
MongoDB, despite being a lot more detailed (due to
much simple development)
12. Why NOSQL is faster than SQL ? - Scalling
Simple web application with not much traffic
Application server, database server all on one machine
13. Scalling contd..
More traffic comes in
Application server
Database server
Even more traffic comes in
Load balancer
Application server x2
Database server
14. Scalling contd..
Even more traffic comes in
Load balancer x N
easy
Application server x N
easy
Database server xN
hard for SQL databases
16. Scalling contd..
NoSQL Scalling -
Need more storage?
Add more servers!
Need higher performance?
Add more servers!
Need better reliability?
Add more servers!
17. Scalling Summary
You can scale SQL databases (Oracle, MySQL, SQL
Server…)
This will cost you dearly
If you don’t have a lot of money, you will reach limits quickly
You can scale NoSQL databases
Very easy horizontal scaling
Lots of open-source solutions
Scaling is one of the basic incentives for design, so it is well
handled
Scaling is the cause of trade-offs causing you to have to use
map/reduce
18. Characterstics
Almost infinite horizontal scaling
Very fast
Performance doesn’t deteriorate with growth (much)
No fixed table schemas
No join operations
Ad-hoc queries difficult or impossible
Structured storage
Almost everything happens in RAM
19. NOSQL Types
Wide Column Store / Column Families
Document Store
Key Value / Tuple Store
Graph Databases
Object Databases
XML Databases
Multivalue Databases
21. Key Value Stores
Lineage: Amazon's Dynamo paper and Distributed
HashTables.
Data model: A global collection of key-value pairs
Example systems
Google BigTable , Amazon Dynamo, Cassandra,
Voldemort , Hbase , …
Implementation: efficiency, scalability, fault-tolerance
Records distributed to nodes based on key
Replication
Single-record transactions, “eventual consistency”
22. Documented Databases
Lineage: Inspired by Lotus Notes.
Data model: Collections of documents, which
contain key-value collections (called "documents").
Example: CouchDB, MongoDB, Riak
23. Graph Database
Lineage: Draws from Euler and graph theory.
Data model: Nodes & relationships, both which can
hold key-value pairs
Example: AllegroGraph, InfoGrid, Neo4j
24. Map Reduce Framework
Google’s framework for processing highly
distributable problems across huge datasets
using a large number of computers
Let’s define large number of computers
Cluster if all of them have same hardware
Grid unless Cluster (if !Cluster for old-style programmers)
Process split into two phases
Map
Take the input, partition it delegate to other machines
Other machines can repeat the process, leading to tree structure
Each machine returns results to the machine who gave it the task
25. Map Reduce Framework contd..
Reduce
collect results from machines you gave the tasks
combine results and return it to requester
Slower than sequential data processing, but massively parallel
Sort petabyte of data in a few hours
Input, Map, Shuffle, Reduce, Output
27. Real World Use
Cassandra
Facebook (original developer, used it till late 2010)
Twitter
Digg
Reddit
Rackspace
Cisco
BigTable
Google (open-source version is HBase)
MongoDB
Foursquare
Craigslist
Bit.ly
SourceForge
GitHub
28. MONGODB
Document store
Basic support for dynamic (ad hoc) queries
Query by example (nice!)
Conditional Operators
<, <=, >, >=
$all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $si
ze, $type
29. MONGODB
Data is stored as BSON (binary JSON)
Makes it very well suited for languages with native JSON support
Map/Reduce written in Javascript
Slow! There is one single thread of execution in Javascript
Master/slave replication (auto failover with replica sets)
Sharding built-in
Uses memory mapped files for data storage
Performance over features
On 32bit systems, limited to ~2.5Gb
An empty database takes up 192Mb
GridFS to store big data + metadata (not actually an FS)
30. CASANDRA
Written in: Java
Protocol: Custom, binary (Thrift)
Tunable trade-offs for distribution and replication
(N, R, W)
Querying by column, range of keys
BigTable-like features: columns, column families
Writes are much faster than reads (!)
Constant write time regardless of database size
Map/reduce possible with Apache Hadoop
31. Some more info about Cassndra in Facebook
Cassandra is open source DBMS from Appache
software foundation.
Cassandra provides a structured key-value
store with tunable consistency
Cassandra is a distributed storage system for
managing structured data that is designed to scale to
a very large size across many commodity
servers, with no single point of failure
It is a NoSQL solution that was initially developed
by Facebook and powered their Inbox Search feature
until late 2010
32. HBASE
Written in: Java
Main point: Billions of rows X millions of columns
Modeled after BigTable
Map/reduce with Hadoop
Query predicate push down via server side scan and get filters
Optimizations for real time queries
A high performance Thrift gateway
HTTP supports XML, Protobuf, and binary
Cascading, hive, and pig source and sink modules
No single point of failure
While Hadoop streams data efficiently, it has overhead for
starting map/reduce jobs. HBase is column oriented
key/value store and allows for low latency read and writes.
Random access performance is like MySQL
33. COUCHDB
Written in: Erlang
Main point: DB consistency, ease of use
Bi-directional (!) replication, continuous or ad-hoc, with conflict
detection, thus, master-master replication. (!)
MVCC - write operations do not block reads
Previous versions of documents are available
Crash-only (reliable) design
Needs compacting from time to time
Views: embedded map/reduce
Formatting views: lists & shows
Server-side document validation possible
Authentication possible
Real-time updates via _changes (!)
Attachment handling
CouchApps (standalone JS apps)
34. HADOOP
Apache project
A framework that allows for the distributed processing of
large data sets across clusters of computers
Designed to scale up from single servers to thousands of
machines
Designed to detect and handle failures at the application
layer, instead of relying on hardware for it
Created by Doug Cutting, who named it after his son's toy
elephant
Hadoop subprojects
Cassandra
HBase
Pig
Hive was a Hadoop subproject, but is now a top-level Apache project
35. HADOOP contd..
Scales to hundreds or thousands of computers, each with several
processor cores
Designed to efficiently distribute large amounts of work across a
set of machines
Hundreds of gigabytes of data constitute the low end of Hadoop-
scale
Built to process "web-scale" data on the order of hundreds of
gigabytes to terabytes or petabytes
Uses Java, but allows streaming so other languages can easily
send and accept data items to/from Hadoop
36. HADOOP contd..
Uses distributed file system (HDFS)
Designed to hold very large amounts of data (terabytes or even
petabytes)
Files are stored in a redundant fashion across multiple
machines to ensure their durability to failure and high
availability to very parallel applications
Data organized into directories and files
Files are divided into block (64MB by default) and distributed
across nodes
Design of HDFS is based on the design of the Google
File System
37. HIVE
A petabyte-scale data warehouse system for Hadoop
Easy data summarization, ad-hoc queries
Query the data using a SQL-like language called
HiveQL
Hive compiler generates map-reduce jobs for most
queries
38. Conclusion
NoSQL is a great problem solver if you need it
Choose your NoSQL platform carefully as each is
designed for specific purpose
Get used to Map/Reduce
It’s not a sin to use NoSQL alongside (yes)SQL
database