VoIP Service and Marketing using Odoo and Asterisk PBX
Big Data, NoSQL with MongoDB and Cassasdra
1. Big Data and NoSQL with MongoDB &
Cassandra
NOSQL Intro with MongoDB and Cassandra
1
2. -
Brian Enochson
- SW Engineer who has worked as designer / developer
on NOSQL (Mongo, Cassandra, Hadoop)
- Specialize in SW Development, architecture and
training
Brian Enochson
brian.enochson@gmail.com
Twitter @benochso
Google Plus
https://plus.google.com/+BrianEnochson
NOSQL Intro with MongoDB and
Cassandra
2
3. •
•
•
•
•
Presentation Intro
Introduction to Big Data
Introduction to NoSQL
Relational Database to NoSQL technology
contrast & compare
NoSQL landscape
NOSQL Intro with MongoDB and
Cassandra
3
4. •
•
•
•
•
•
•
Introduction to MongoDB
MongoDB Components, capabilities and
common use cases
Json & BsON
Documents, collections, references and Mongo
ID
Querying
Data Modeling/Schema Design
Replication & Sharding
NOSQL Intro with MongoDB and
Cassandra
4
7.
•
Why are database like Mongo or Cassandra
needed?
To understand one needs to look at
• the history of databases
• How systems were built in the past
•
Then examine modern applications
• Web scale
• Data acquisition
•
Other factors like cost of H/W
NOSQL Intro with MongoDB and
Cassandra
7
8. •
•
•
•
•
•
1960’s – Hierarchical and Network type (IMS and
CODASYL)
1970’s – Beginnings of theory behind relational model. Codd
1980’s – Rise of the relational model. SQL. E/R Model
(Chen)
1990’s – Access/Excel and MySQL. ODMS began to appear
2000;’s – Two forces; large enterprise and open source.
Google and Amazon. CAP Theorem (more on that to
come…)
2010’s – Immergence of NoSQL as an industry player and
viable alternative
NOSQL Intro with MongoDB and
Cassandra
8
9. •
Developers today are faced with Internet scale
100,000’s of users
Low cost of storage
Increased processing power
Ability to capture (and need) of millions of events. Caching
solves it to an extent but brings other complexities
• Real-time
• Need to scale out and not up. (add infinite number of low
cost machines vs. replace with a more powerful machine).
•
•
•
•
•
Cost
• Let’s not forget for enterprise DB’s Internet scale can become
expensive
• Open source DB’s may solve license cost, but don’t ignore
operational costs
NOSQL Intro with MongoDB and
Cassandra
9
11. •
Relational
• Divide into tables, relate into foreign keys, DB constraints,
normalized data, the Interface is SQL
•
NoSQL
• Store in schemaless format, redundancy encouraged,
application access determines the storage format (your
queries).Interface varies and is optimized for the
implementation, no forced DB constraints.
NOSQL Intro with MongoDB and
Cassandra
11
12. Luckily, due to the large number of compromises made
when attempting to scale their existing relational
databases, these tradeoffs were not so
foreign or distasteful as they might have been.
Greg Burd https://www.usenix.org/legacy/publications
/login/2011-10/openpdfs/Burd.pdf
NOSQL Intro with MongoDB and
Cassandra
12
13.
Eventual consistency
Application has increased responsibility such
as maintain consistency & handle transactions
Store redundant data
NOSQL Intro with MongoDB and
Cassandra
13
14. Driving force in requiring new technology is often
referred to as the “3 V’s”.
•
•
•
Volume – amount of data
Variety – range of data types and sources
Velocity – speed of data in and out
NOSQL Intro with MongoDB and
Cassandra
14
15. NoSQL != Big Data
NoSQL products were created to help solve the big
data problem.
Big data is a much larger problem than just
storage. Analysis tools like Hadoop, messaging
systems like Kafka, real time processing engines
like Storm and machine learning (Mahout) all help
solve the big data problem.
NOSQL Intro with MongoDB and
Cassandra
15
16. Document DB
Wide Column– Column Family
Cassandra, HBASE, Amazon SimpleDB
Key Value
•
Riak, Redis, DynamoDB, Voldemort, MemcacheDB
Graph
•
Neo4J, OrientDB
Search (search can also be a persistence store)
•
MongoDB, CouchDB,
Lucene, Solr, ElasticSearch
Many many many, many more! (http://nosql-database.org/)
NOSQL Intro with MongoDB and
Cassandra
16
17.
Choosing the right NoSQL type and eventual product
depends on…
Type of Data
•
•
•
•
•
•
•
•
One key and a lot of data?
Schema variance
High volume of data?
Storing, media, blobs,
Document oriented?
Tracking relationships?
Combination?
Multi-Datacenter
Type of Access
Volumes of Data (there is big data and there is BIG DATA)
Need/want support/services/training
NOSQL Intro with MongoDB and
Cassandra
17
19. PROBABLY HAVE HEARD OF ACID
•
Atomic – All or None
•
Consistency – What is written is valid
•
Isolation – One operation at a time
•
Durability – Once committed to the DB, it stays
This is the world we have lived in for a long time…
NOSQL Intro with MongoDB and
Cassandra
19
20.
Many may have heard this one
CAP stands for Consistency, Availability and
Partition Tolerance
• Consistency –like the C in ACID. Operation is all or nothing,
• Availability – service is available.
• Partition Tolerance – No failure other than complete network
failure causes system not to respond
** http://www.cs.berkeley.edu/~brewer/cs262b2004/PODC-keynote.pdf
NOSQL Intro with MongoDB and
Cassandra
20
21. In Mongo terms you can have 2 of 3. Availability, Partition-Tolerance
or Eventual Consistency.
NOSQL Intro with MongoDB and
Cassandra
21
23. •
So we are talking about large amounts of data
•
High velocity of acquisition
•
A lot of variety that we need to store. Will
worry about it later how to handle (or not)
•
Need to scale and not break the bank
•
Want the database to support agile, not hinder
NOSQL Intro with MongoDB and
Cassandra
23
24. •
Maybe consider going relational if
• Highly transactional (FoundationDB?)
• Business Intelligence Systems (Hadoop may make this not
true)
• Don’t be fooled by fear of losing ACID….
http://highscalability.com/blog/2013/5/1/myth-eric-brewer-onwhy-banks-are-base-not-acid-availability.html
NOSQL Intro with MongoDB and
Cassandra
24
27. Few
•
•
•
•
•
•
high level points
Document Oriented
Storage format is JSON (actually BSON)
Replication built in
Master / slave architecture
Strong querying support
Name from "humongous"
NOSQL Intro with MongoDB and
Cassandra
27
29. •
No cross document transactions
•
No joins
•
Replication – master / slave
•
Sharding
NOSQL Intro with MongoDB and
Cassandra
29
30.
-
* Credit – Dwight Merriman, Founder and CEO – MongoDB (was 10Gen)
NOSQL Intro with MongoDB and
Cassandra
30
31.
Master Slave and Secondary Reads
** http://docs.mongodb.org/manual/core/replication-introduction/
NOSQL Intro with MongoDB and
Cassandra
31
32.
Primary
Receives all write requests
Replica set can only have on primary
Mongo stored all changes in oplog
Secondary
Replicates primary oplog
Clients can prefer to read from secondaries
If primary goes down a new primary is elected (after
10 seconds no response)
NOSQL Intro with MongoDB and
Cassandra
32
34.
Shards
Store the data, normally in production each shard is
a replica set
Routers
Routes client operations to shards based on shard
key, can have more than one for availability
Shard key is range based or hashed
Config Servers
Contains cluster metadata
Production there are 3 config servers
NOSQL Intro with MongoDB and
Cassandra
34
35.
•
•
At its simplest form, Mongo is a document oriented database
MongoDB stores all data in documents, which are
JSON-style data structures composed of field-andvalue pairs.
MongoDB stores documents on disk in the BSON
serialization format. BSON is a binary representation of
JSON documents. BSON contains more data types than
does JSON.
** For in-depth BSON information, see bsonspec.org.
NOSQL Intro with MongoDB and
Cassandra
35
38.
Documents have the following rules:
The maximum BSON document size is 16
megabytes.
The field name _id is reserved for use as a
primary key; its value must be unique in the
collection.
The field names cannot start with the $
character.
The field names cannot contain the . character.
NOSQL Intro with MongoDB and
Cassandra
38
43.
3_imp_exp.txt
Mongo provides tools for getting data in and
out of the database
• Data Can Be Exported to json files
• Json files can then be Imported
NOSQL Intro with MongoDB and
Cassandra
43
45.
Aggregation Framework
Uses a pipeline model to perform a series of operations
on data. Common is a match phase (selection) and then
grouping (create result)
Map Reduce
Two phases
Map that creates one or more documents from each input
document
Reduce phase that combines output from Map into some
result
Finalize – optional that can perform some logic (e.g. sorting)
on reduce output
NOSQL Intro with MongoDB and
Cassandra
45
46.
5_admin.txt
• how dbs
• show collections
• db.stats()
• db.posts.stats()
• db.posts.drop()
• db.system.indexes.find()
NOSQL Intro with MongoDB and
Cassandra
46
47. •
•
•
•
•
Remember with NoSql redundancy is not evil
Applications insure consistency, not the DB
Application join data, not defined in the DB
Datamodel is schema-less
Datamodel is built to support queries usually
NOSQL Intro with MongoDB and
Cassandra
47
48. •
Your basic units of data (what would be a document)?
•
How are these units grouped / related?
•
•
How does Mongo let you query this data, what are the
options?
Finally, maybe most importantly, what are your
applications access patterns?
•
•
•
•
•
Reads vs. writes
Queries
Updates
Deletions
How structured is it
NOSQL Intro with MongoDB and
Cassandra
48
49.
Normalized
• Similar to relational model.
• One collection per entity type
• Little or no redundancy
• Allows clean updates, familiar to many SQL users,
easier to understand
NOSQL Intro with MongoDB and
Cassandra
49
51. •
From parent to child
{
name: "O'Reilly Media",
books: [12346789, 234567890, ...]
}
•
From child to parent
{
_id: 123456789,
title: "MongoDB: The Definitive Guide",
publisher_id: "oreilly"
}
NOSQL Intro with MongoDB and
Cassandra
51
52.
•
•
•
Often used pattern in Mongo is to embed
information as subdocuments.
Used when there is a contains relationship
Easier querying (when related data is often
used together)
Need to keep 16 MB document size in mind
NOSQL Intro with MongoDB and
Cassandra
52
54.
•
Many or few collections
Many Collections
•
•
•
•
•
As seen in normalized
Clean and little redundancy
May not provide best performance
May require frequent updates to application if new types added
Multiple Collections
• Middle ground, partially normalized
•
Not many collections
• One large generic collection
• Contains many types
• Use type field
NOSQL Intro with MongoDB and
Cassandra
54
55. •
•
Document Growth – will relocate if exceeds allocated
size
Atomicity
• Atomic at document level
• Consideration for insertions, remove and multi-document updates
Sharding – collections distributed across mongod instances,
uses a shard key.
Indexes – index fields often queries, indexes affect write
performance slightly
Consider using TTL to automatically expire documents
NOSQL Intro with MongoDB and
Cassandra
55
57. Mongo Driver
Supplied by MongoDB Itself
Easy to setup
Housed on maven repo
Morphia
Uses App Model
Handles References Well
Spring Mongo
Great if using Spring already
NOSQL Intro with MongoDB and
Cassandra
57
59.
Get MEAN
Mongo, Express, Angular and Node
http://bitnami.com/stack/mean
http://mean.io
Can install, in a VM or even in the cloud
NOSQL Intro with MongoDB and
Cassandra
59
60.
Database in the cloud
https://mongolab.com/
Can access using shell, GUI Mongo explorer,
mongoimport, mongoexport and use in
application
Amazon, Rackspace, Joyent or Azure
NOSQL Intro with MongoDB and
Cassandra
60
61. MongoDB: The Definitive Guide, 2nd Edition
By: Kristina Chodorow
Publisher: O'Reilly Media, Inc.
Pub. Date: May 23, 2013
Print ISBN-13: 978-1-4493-4468-9
Pages in Print Edition: 432
MongoDB in Action
By: Kyle Banker
Publisher: Manning Publications
Pub. Date: December 16, 2011
Print ISBN-10: 1-935182-87-0
Print ISBN-13: 978-1-935182-87-0
Pages in Print Edition: 312
The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing
By Eelco Plugge; Peter Membrey; Tim Hawkins
Apress, September 2010
ISBN: 9781430230519
327 pages
NOSQL Intro with MongoDB and
Cassandra
61
62. MongoDB Applied Design Patterns
By: Rick Copeland
Publisher: O'Reilly Media, Inc.
Pub. Date: March 18, 2013
Print ISBN-13: 978-1-4493-4004-9
Pages in Print Edition: 176
MongoDB for Web Development (rough cut!)
By: Mitch Pirtle
Publisher: Addison-Wesley Professional
Last Updated: 14-JUN-2013
Pub. Date: March 11, 2015 (Estimated)
Print ISBN-10: 0-321-70533-5
Print ISBN-13: 978-0-321-70533-4
Pages in Print Edition: 360
Instant MongoDB
By: Amol Nayak;
Publisher: Packt Publishing
Pub. Date: July 26, 2013
Print ISBN-13: 978-1-78216-970-3
Pages in Print Edition: 72
NOSQL Intro with MongoDB and
Cassandra
62
64. Let’s look briefly at Cassandra as an
alternative to Mongo
NOSQL Intro with MongoDB and
Cassandra
64
65. •
Developed At Facebook, based on Google Big Table and
Amazon Dynamo **
•
Open Sourced in mid 2008
•
Apache Project March 2009
•
•
•
Commercial Support through Datastax (originally known as
Riptano, founded 2010)
Used at Netflix, eBay and many more. Reportedly 300 TB
on 400 machines largest installation
Current version is 2.0.3
NOSQL Intro with MongoDB and
Cassandra
65
66. •
No Single Point of Failure – highly available.
• Peer to Peer – no master
•
•
•
•
•
•
•
•
Data Center Aware – distributed architecture
Linear Scaling – just add hardware
Eventual Consistency, tunable tradeoff between
latency and consistency
Architecture is optimized for writes.
Can have 2 billion columns (cells)!
Data modeling for reads. Design starts with looking at
your queries. (sound familiar?)
With CQL became more SQL-Like, but no joins, no
subqueries, limited ordering (but very useful)
Column Names can part of data, e.g. Time Series
NOSQL Intro with MongoDB and
Cassandra
66
67.
** Important Term **
Quorum : Q = N / 2 + 1.
We get consistency in a BASE world by satisfying W + R >
N
3 obvious ways:
1. W = 1, R = N
2. W = N, R = 1
3. W = Q, R = Q
(N is replication factor, R = read replica count, W = write replica count)
NOSQL Intro with MongoDB and
Cassandra
67
68.
C* data model is made of these:
Column – a name, a value and a timestamp. Applications
can use the name as the data and not use value. (RDBMS like a
column).
Row – a collection of columns identified by a unique key.
Key is called a partition key (RDBMS like a row).
Column Family – container for an ordered collection
rows. Each row is an ordered collection of columns.
Each column has a key and maybe a value. (RDBMS like a table).
This is also known as a table now in C* terms.
Keyspace – administrative container for CF’s. It is a
namespace. Also has a replication strategy – more late.
(RDBMS like a DB or schema).
NOSQL Intro with MongoDB and
Cassandra
68
70.
Tokens – partitioner dependent element on the ring.
Each node has a single unique token assigned.
Each node claims a range of tokens that is from its token to
token of the previous node on the ring.
Use this formula
Initial_Token= Zero_Indexed_Node_Number * ((2^127) /
Number_Of_Nodes)
In cassandra.yaml
initial token=42535295865117307932921825928971026432
** http://blog.milford.io/cassandra-token-calculator/
NOSQL Intro with MongoDB and
Cassandra
70
71. •
•
Replication is how many copies of each piece of
data that should be stored. In C* terms it is
Replication Factor or “RF”.
In C* RF is set at the keyspace level:
CREATE KEYSPACE drg_compare WITH replication = {'class':'SimpleStrategy',
'replication_factor':3};
•
How the data is replicated is called the
Replication Strategy
• SimpleStrategy – returns nodes “next” to each other on
ring, Assumes single DC
• NetworkTopologyStrategy – for configuring per data
center. Rack and DC’s aware.
update keyspace UserProfile with strategy_options=[{DC1:3, DC2:3}];
NOSQL Intro with MongoDB and
Cassandra
71
73.
Using token generation values from before. 4 node cluster.
Write value with token 32535295865117307932921825928971026432
NOSQL Intro with MongoDB and
Cassandra
73
75. •
•
•
When writing, Coordinator Node will be selected. Selected
at write (or read) time. Not a SPF!
Using Gossip Protocol nodes share information with each
other. Who is up, who is down, who is taking which token
ranges, etc. Every second, each node shares with 1 to 3
nodes.
Consistency Level (CL) – says how many nodes must agree
before an operation is a success. Set at read or write
operation.
• ONE – coordinator will wait for one node to ack write (also TWO,
THREE). One is default if none provided.
• QUORUM – we saw that before. N / 2 + 1. LOCAL_QUORUM,
EACH_QUORUM
• ANY – waits for some replicate. If all down, still succeeds. Only for
writes. Doesn’t guarantee it can be read.
• ALL– Blocks waiting for all replicas
NOSQL Intro with MongoDB and
Cassandra
75
76.
3 important concepts:
Read Repair - At time of read, inconsistencies are noticed
between nodes and replicas are updated. Direct and
background. Direct is determined by CL.
Anti-Entropy Node Repair - For data that is not read
frequently, or to update data on a node that has been down
for a while, the nodetool repair process (also called antientropy repair). Builds Merkle trees, compares nodes and
does repair.
Hinted Handoff - Writes are always sent to all replicas for
the specified row regardless of the consistency level
specified by the client. If a node happens to be down at the
time of write, its corresponding replicas will save hints
about the missed writes, and then handoff the affected rows
once the node comes back online. This notification happens
is via Gossip. Default 1 hour.
NOSQL Intro with MongoDB and
Cassandra
76
77. •
•
Interaction with Cassandra can be done using one of
supplied clients such as CLI or CQL. Otherwise client
applications are built using a language client library.
Many clients in multiple languages. Including Java,
.NET, Python, Scala, Go, PHP, Node.js, Perl, Ruby, etc.
• Java:
• Hector wraps the underlying Thrift API. Hector is one of the most
commonly used client libraries.
• Astyanax is a client library developed by Netflix .
• Datastax CQL – newest CQL driver, will be very familiar to JDBC
developers
• And many more … (JPA)
•
Also exists Datastax OPSCenter and other various
GUI’s and REST API (Virgil)
NOSQL Intro with MongoDB and
Cassandra
77
78.
Many More Topics / Information Related to C*
not covered
Great for Fast Writes
No Single POF
Data Center Aware
Also Relative Ease Of Use
NOSQL Intro with MongoDB and
Cassandra
78