An perspective into the raise of NoSQL systems and an comparison between RDBMS and NoSQL technologies.
The basic idea of the presentation originated while trying to understand the different alternatives available for managing data while building a fast, highly scalable, available, and reliable enterprise application.
2. RDBMS – The origins
Concepts, Architecture and Principles
Golden Age – Way of life.
Changing Times– New Problems, New Needs
Attack on the citadel - Revisiting the norms
Ignited Minds – Working towards NoSQL Solutions
Way Ahead– It is a Cloudy out there
3. Girish Narasimha Raghavan
Over 15 years experience building distributed, large
scale and highly available enterprise systems.
Current interest include build SAC (Social, Big Data
Analytics, and Cloud) solutions.
Likes to write and discuss technologies and its
applications to solve real world problems.
http://randomtechthought.blogspot.com
4.
5. In the world data abounds. Always has and always will.
Record keeping is as old as Human race.
Consistent quest to improve storing , accessing, and analyzing
records
The early machines had serious shortcomings.
only a very limited amount of program code and data could be stored
in memory.
Electromagnetic data storage was feasible only at an extremely high
cost.
Storing Data was an issue
Organizations had to store data – related to Administration,
Research, Operations.
Data stored in proprietary formats – Database Systems did not exist
Plagued by data integrity issues
Non standard application logic for accessing stored data
6. First attempt: File based systems
Data sets were growing and accumulating.
Data had to be managed at a detailed transaction level.
Computing systems started to be used for critical business
needs.
Data inconsistency and redundancy.
Enter Database Systems
Attempts to standardize the processes and rules to store and
access data.
Intention to reuse, resell and redeploy solutions across
organizations (with significant customizations).
Attempt to proactively manage Data Integrity and Quality.
7. Database Systems and concepts Evolve
Hierarchical DBMS
Information represented using parent/child relationships
Tree structure is primary data structure.
Network DBMS
The relationships is represented in form of a network.
Graph is the primary data structure.
Challenges Galore
Hardware Dependency – Software strongly dependent on the
underlying hardware.
Modeling challenges – Representing data under a common
structure.
Integration issues - Integrating across dependent packages was a
nightmare.
Introducing new functionality and updates - Solution providers
struggled with it across customized software deployment.
8. Father of the Relational
Database model
Edgar F Codd
A British Computer Scientist
who made significant
contributions to the theory of
Relational Databases while
working for IBM.
9. Landmark Paper by Codd - “A relational Model of Data for
large shared Data Banks”.
Independence of Data from the Hardware- and Storage
Implementation.
automatic navigation to the data set through high level
nonprocedural language for data access.
Concept of keys (primary, secondary).
theoretical proposal, no practical design or implementation.
Codd’s 12 rules for Relational management System
http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF.
pdf
10.
11. Application Reporting
1 Solutions
Database Databases
Application Management Data
2 Systems (DBMS) Strorage
Application Future
3 Applications
12. Data Definition
For describing data and data structures for handling the data
Data Manipulation
For describing the operations associated with the data like storage, query, change,
etc.
Data Security and Integrity
For ensuring secure and controlled access to storage and manipulation of data.
For ensuring correctness, consistency and reliability of the data stored .
Data Recovery and Concurrency
For providing and enforcing recovery and concurrency controls.
Data Dictionary
For providing information about the data stored.
For Liaisoning between the conceptual and physical storage.
Performance
For ensuring all the above mentioned operations are performed efficiently and
effectively
13. External/User
How the user access and sees the data
[Tables, Views]
Conceptual/Logical
How data is organized logically
[Table Spaces]
Physical/Internal
How data is stored internally
[Data Files]
14. Relation (Tables)– Set of Tuples that have the
same attributes.
Tuples (Rows) – A Tuple usually represents an
object and information about that object.
Attribute (Columns)– Represent a particular
characteristic of that object
Domain - A domain describes the set of permitted values for a given attribute.
It is the set from which the values of an attribute can be defined.
Constraints - Constraints make it possible to further restrict the domain of an
attribute. Constraints help in binding the attribute to a set of rules.
Primary Key - A primary key is a (set of) attribute (s) that uniquely defines a
relationship within a database.
Foreign Key - The foreign key can be used to cross-reference tables.
Cardinality - Expresses the number of instances of the entity to which another
entity can be associated via a relation
Index - An index is a mechanism for providing quicker access to data. Indices
can be created on any combination of attributes on a relation.
15. Based on the perception that real world can be modeled around
base objects (entities) and relationship among them.
Modeling of data in a top down fashion
Conceptual Model – The model is the highest and least granular
model that defines master reference data entities that are
commonly used in the problem space.
Logical Model – The model generally builds over the conceptual
model by adding additional granular details like operational and
transactional data entities.
Physical Model - Specifies relational database objects such
as database tables, database indexes such as unique key indexes,
and database constraints.
The models can be visualized through what is commonly known
as ER-Diagrams.
16. Process for organizing the attributes and tables of a relational
database to minimize redundancy and dependency.
Objectives (as specified by Codd)
To free the collection of relations from undesirable insertion, update
and deletion dependencies.
To reduce the need for restructuring the collection of relations, as new
types of data are introduced, and thus increase the life span of
application programs.
To make the relational model more informative to users.
To make the collection of relations neutral to the query statistics, where
these statistics are liable to change as time goes by.
Normal Forms (NF)
1NF - it contains atomic values only
2NF - 1NF + every non-key attribute is dependent on the primary key
3NF - 2NF + every non-key attribute is non-transitively dependent on
the primary key
17. Properties that guarantee that database transactions are processed
reliably.
Single logical operation (involving multiple steps) is called transaction.
Properties
Atomicity – “All or Nothing” – If one part of the transaction fails, entire
transaction fails.
Consistency – Any data written to the database must be valid according
to all defined rules, and constraints.
Isolation – Even during concurrent executions, the system result in a
state that is same as the state which will be obtained when executed
serially.
Durability - Once a transaction has been committed, the results will
be stored permanently irrespective of errors and crashes that can occur
post commit.
In RDBMS ACID properties are implemented using various
techniques like locking and Multi Versioning
18.
19.
20. RDBMS based solutions is generally the first choice for
database storage/access needs
RDBMS solutions is now mature and predictable.
An army of skilled specialists exists for using,
managing and maintaining RDBMS based systems
RDBMS has spawned an ecosystem of products that
makes choosing RDBMS as no brainer
21. Ensures Consistent behavior
With the table structure as the base, RDBMS provides a consistent mechanism for
storing and accessing different data sets.
Removes Redundancies
Through Normal forms, redundancies in the data are removed thereby addressing
the errors that can arise from consistency of the data stored
Avoid errors
Ensures Data integrity and quality by ensuring consistent storage, enforcing
constraints and relationships and with ability to check data as they are entered
Facilitates Easy analysis
With the SQL based query as the foundation, analyzing different data set is seamless.
Also given the history of RDBMS, users are provided with a vast repository of tools to
perform analysis.
Ensures Robust Maintenance and Management
Database administrators are provided with tools that enable them to easily
maintain, test, repair and back up the databases housed in the system.
Is Secure
Offers good level of security and access control. Whole or part of the data can be
securely shared across multiple users(applications) based on the privileges granted
to them(it).
22.
23. Raise of Social Networks during early 2000s
World Wide Web acts as the foundation
Shift in communication patterns
Sharing of personal information and usage of the same
Everyone turned into a publisher
Increased focus around personalization
Recommendations, Ratings, Preferences and providing
Personalized interfaces
Big Data Flood
More data is being generated currently than what was generated till
now throughout history of human kind
Need to store and process unstructured or semi structured data at
volumes previously not anticipated and at frequencies not
encountered previously
25. Accessible by users across the globe
Geography is irrelevant
Facebook, Google, Yahoo, Twitter, etc. have users across the world
Highly networked and distributed systems
Systems are accessed and connected over the Internet
Need to be highly scalable
Should be able to handle additional load without redesign
Amazon sees a manifold increase in traffic to the site during the holiday seasons
Expected to be highly available
Systems will be available for access and operations always
Google will incur a huge revenue and credibility loss if the site goes down
Handle large data sets hitting the systems with high frequency
The data need to be stored and processed very quickly
Number of likes and comments on Facebook has exceeded 2.7 billion per day
26.
27. Brewers CAP Theorem
You can get only two out of the following three
Consistency – Same as Atomicity. You get “All or Nothing”
Availability - Need to be available for operations always
Partition Tolerance – Need to work when some nodes are not
accessible.
RDBMS were essentially designed for CA
Latency (response times) is an unfortunate tradeoff for
consistency
Partition tolerance becomes essential in distributed
systems
28. Beyond a point you cannot afford to Scale up storage
It becomes very expensive to keep scaling up.
Is strict consistency really so important?
Ensuring consistency slows the system
Google found that moving from a 10-result page loading in 0.4 seconds to
a 30-result page loading in 0.9 seconds decreased traffic and ad revenues
by 20% (Linden 2006)
Redundancy can be managed
Joins across normalized database tables is less efficient than reading
from a data store
Not All data is relational
Fitting every kind of data under the Rigid Schema structure of RDBMS is
a challenge
Data read from RDBMS modeled back in its original model (say tree,
graph, key value) induces significant stress on computing resources.
Attributes (columns) are restricted by domain to store similar data.
Managing semi structured, unstructured data like documents becomes a
challenge.
29. CRUD (Create, Read, Update, Delete) is crude
Updates and deletes should never be allowed as they destroy
information.
Logical and physical separation of concerns ignored
Relational model is a logical model
Database products implemented the relational model at the physical
level as a set of btree files with multiple indexes.
Induces artificial overhead onto managing the database.
It is over spinning disks
All RDBMS implementations assume that the data is coming from the
disks
Legacy of an era when memory was expensive.
Memory based systems will be faster
Databases are big and slow
Fundamentally not designed for big data sets
Long queries get slower with more data
30.
31. Core Tenets
Basically Available
System seem to work all the time
Soft State
It doesn’t have to be consistent all the time
Eventual Consistency
Becomes consistent eventually (at some later time)
Significance
BASE is diametrically opposed to ACID.
ACID is pessimistic and forces consistency at the end of every operation
BASE is optimistic and accepts that the database consistency will be in a
state of flux.
The availability is achieved through supporting partial failures
without total system failure
It is ok for the system to be available for 80% of users and limit failure
to 20% of the user.
Users should understand the implication of Eventual Consistency
Factors in a probability of data loss. Safety of the data is the tradeoff
Need to understand how eventual is Eventual
32. NoSQL – Not Only SQL
It is not SQL and it is not Relational
Essential Feature set
Elastic Scaling – Rely on Scale out rather than Scale up
Big Data – Handle High Volume, High Velocity, High Variability
Commoditize Manageability – Reduce dependence on highly skilled
DBA and lower administration costs
Economics – Build over commodity hardware
Flexible data model – Remove data model based restrictions.
Applicability
Performance and real time nature over consistency
High scalability
Store and retrieve large data sets
Does not require a relational model
33. Key Value
Idea is to use a hash table where there is a unique key and a pointer to a
particular item of data. Simplest to implement.
it is inefficient when you are only interested in querying or updating part
of a value
Column Store
Created to store and process very large amounts of data distributed over
many machines
Still keys but they point to multiple columns.
The columns are arranged by column family.
Document
The model is basically versioned documents that are collections of other
key-value collections.
The semi-structured documents are stored in formats like JSON.
allowing nested values associated with each key
Document databases support querying more efficiently.
Graph
flexible graph model is used which, again, can scale across multiple
machines
34. Access Interfaces
Language Specific
REST/HTTP Thrift Map Reduce
API
Logical Data Model
Key Value Column Family Store Document Graph
Support and Distribution
Multi Data Center Dynamic
CAP Support Proactive Monitoring
Support Provisioning
Data Persistence
Combination of Memory and
Memory Based Disk Based
Disk
35. NoSQL
Key Value Column Store Document Graph
MemCached SimpleDB CouchDB Neo4J
Redis BigTable MangoDB InfoGrid
SimpleDB Hbase Lotus Domino FlockDB
Tokyo Cabinet Cassandra Riak InfiniteGraph
Dynamo HyperTable
Voldemort Azure TS
36.
37. It is not Mature
RDBMS is mature, stable and functionally rich.
Most NoSQL alternatives are in pre-production versions with many key
features yet to be implemented.
Support
Nost NoSQL systems are open source projects.
Support mostly offered by startup companies, with reach and
credibility not on par with RDBMS Vendors.
Analytics
NoSQL databases offer few facilities for ad-hoc query and analysis.
Even a simple query requires significant programming expertise.
At present, commonly used BI tools do not provide credible
connectivity to NoSQL.
Administration and Maintenance
The desired goal of zero maintenance is far away.
In reality significant effort t required to maintain the systems.
Expertise
Currently very limited awareness and knowledge
38. Scalability
Master Slave - One master many slaves
Write to master; Read from any of the slaves
Partitioning – Group and localize related functions across nodes
Partition Vertically (by functions) or Horizontally ( by keys)
Caching - Memory based cache in front of the Database
Address scaling issues due to read and write loads
High Availability
Clustering - Group of systems responsible for a service
Build redundancy into a cluster to eliminate single points of failure
Mirroring and Replication – Maintain a hot standby
Handle planned or unplanned downtimes
Recovery Solutions - dependable data backup, restore, and
recovery procedures
Combine process with tools
39. Performance
Be open to Denormalization – And accelerate reads
Allow redundancy and duplicates to reduce joins
Optimize your costly queries- Analyze and optimize the expensive
queries
Use a mix of design strategy, indices, and analysis from query optimization tools
Invest in better hardware – storage and memory
It is not a bad bet - The storage and memory costs have dropped significantly
Rigid Schemas – Not all data is relational
Even the most schema-less model has some schema
World revolves round the structures
If Key-Value kind of store is needed, You can do the same in any
RDBMS
RDBMS will provide an added advantage of structured access and queries
40. Systems eventually will gravitate towards one of these three
Fast, agile, highly scalable data stores
Handlers of complex transactional semantics
Analytical processors and facilitators
World is never binary
It is never either this or that.
Why fight over technicalities
Drive decisions based on use cases
Choose a model based on the use cases and scenarios
Research and understand what your application needs
Stay away from substituting “Hard work” with “Rhetoric”
Be open to experimentation
41.
42. http://www.guug.de/lokal/muenchen/2007-05-14/rdbmsc.pdf
http://ansonalex.com/infographics/twitter-usage-statistics-2012-infographic/
http://www.mountainman.com.au/software/history/it1.html
http://www.slideshare.net/renguzi/codd
http://cims.clayton.edu/booth/ITDB%204201/Codd%20PDF.pdf
http://www.scribd.com/doc/19381895/RDBMS-Concepts
http://www.gitta.info/DBSysConcept/en/text/DBSysConcept.pdf
http://en.wikipedia.org/wiki/Relational_database
http://en.wikipedia.org/wiki/ACID
http://blogs.hbr.org/now-new-next/2009/05/the-social-data-revolution.html
http://www.go-gulf.com/blog/60-seconds
http://en.wikipedia.org/wiki/CAP_theorem
http://highscalability.com/drop-acid-and-think-about-data
http://queue.acm.org/detail.cfm?id=1394128
http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/
http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
http://rebelic.nl/engineering/the-four-categories-of-nosql-databases/
http://www.slideshare.net/ksankar/nosql-4559402
http://www.thevirtualcircle.com/2008/11/10/6-reasons-why-relational-database-will-be-superseded/
http://www.slideshare.net/sbtourist/scale-your-database-and-be-happy
Note:
Many images used in the deck have been a result of using google image search. Even though, I have not been able to
mention the sources of all the images individually, I extend my sincere thanks for the owners of the images for making the
same available on the net