NoSQL: An Analysis

April 10-12 | Chicago, IL
NoSQL: An Analysis
Andrew J. Brust, Founder and CEO, Blue Badge Insights

Please silence
cell phones

Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNet
Microsoft Regional Director, MVP
Co-chair VSLive! and 17 years as a speaker
Founder, Microsoft BI User Group of NYC
• http://www.msbinyc.com
Co-moderator, NYC .NET Developers Group
• http://www.nycdotnetdev.com
“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News
brustblog.com, Twitter: @andrewbrust
3

Andrew’s New Blog (bit.ly/bigondata)

Agenda
Why NoSQL?
Concepts
NoSQL Categories
Provisioning, market, applicability
Take-aways

NoSQL Data Fodder
Addresses Preferences
Notes
Friends,
Followers
Documents

“Web Scale”
This the term used to justify NoSQL
Scenario is simple needs but “made up for in
volume”
• Millions of concurrent users
Think of sites like Amazon or Google
Think of non-transactional tasks like loading
catalog data to display product page, or
environment preferences

NoSQL Common Traits
Non-relational
Non-schematized/schema-free
Open source
Distributed
Eventual consistency
“Web scale”
Developed at big Internet companies

Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes:
consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation

Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don’t have to be, at least not immediately
• If a new item is entered into the catalog, it’s OK for some customers to see it
even before the other customers’ server knows about it
But catalog info must come up quickly
• Therefore don’t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency

CAP Theorem
Consistency
Availability
Partition
Tolerance
Relational
NoSQL

Indexing
Most NoSQL databases are indexed by key
Some allow so-called “secondary” indexes
Often the primary key indexes are clustered
HBase uses HDFS (the Hadoop Distributed File System), which is
append-only
• Writes are logged
• Logged writes are batched
• File is re-created and sorted

Queries
Typically no query language
Instead, create procedural program
Sometimes SQL is supported
Sometimes MapReduce code is used…

MapReduce
This is not Hadoop’s MapReduce, but it’s conceptually related
Map step: pre-processes data
Reduce step: summarizes/aggregates data
Will show a MapReduce code sample for Mongo soon
Will demo map code on CouchDB

Sharding
A partitioning pattern where separate servers store partitions
Fan-out queries supported
Partitions may be duplicated, so replication also provided
• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a
CDN
Good for keeping data close to processing
• Reduces network traffic when MapReduce splitting takes place

Key-Value Stores
The most common; not necessarily the most popular
Has rows, each with something like a big dictionary/associative array
• Schema may differ from row to row
Common on cloud platforms
• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite,
Redis and Riak
20

Key-Value Stores
Table: Customers
Row ID: 101
First_Name: Andrew
Last_Name: Brust
Address: 123 Main Street
Last_Order: 1501
Row ID: 202
First_Name: Jane
Last_Name: Doe
Address: 321 Elm Street
Last_Order: 1502
Table: Orders
Row ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Row ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Database

Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop

Table: Customers
Row ID: 101
Super Column: Name
Column: First_Name:
Andrew
Column: Last_Name: Brust
Super Column: Address
Column: Number: 123
Column: Street: Main Street
Super Column: Orders
Column: Last_Order: 1501
Table: Orders
Row ID: 1501
Super Column: Pricing
Column: Price: 300
USD
Super Column: Items
Column: Item1: 52134
Row ID: 1502
Super Column: Pricing
Column: Price: 2500
GBP
Super Column: Items
Row ID: 202
Super Column: Name
Column: First_Name: Jane
Column: Last_Name: Doe
Super Column: Address
Column: Number: 321
Column: Street: Elm Street
Super Column: Orders
Column: Last_Order: 1502
Wide Column Stores

Demo
Wide Column Stores

Document Stores
Have “databases,” which are akin to tables
Have “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained
JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV stores
Most popular with developers, startups, VCs
The biggies:
• CouchDB
• Derivatives
• MongoDB

Document Store Application Orientation
Documents can each be addressed by URIs
CouchDB supports full REST interface
Very geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return
HTML
• So you can build entire applications in the database

Database: Customers
Document ID: 101
First_Name: Andrew
Last_Name: Brust
Address:
Orders:
Database: Orders
Document ID: 1501
Price: 300 USD
Item1: 52134
Item2: 24457
Document ID: 1502
Price: 2500 GBP
Item1: 98456
Item2: 59428
Number: 123
Street: Main Street
Most_recent: 1501
Document ID: 202
First_Name: Jane
Last_Name: Doe
Address:
Orders:
Number: 321
Street: Elm Street
Most_recent: 1502
Document Stores

Demo
Document Stores

Graph Databases
Great for social network applications and others where relationships are
important
Nodes and edges
• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and values
Neo4j is a popular graph db

Database
Sent invitation
to
Commented on
photo by
Friend
of
Address
Placed order
Item
2
Item
1
Joe Smith Jane
Doe
Andrew Brust
Street: 123 Main
Street
City: New York
State: NY
Zip: 10014
ID: 52134
Type: Dress
Color: Blue
ID: 24457
Type: Shirt
Color: Red
ID: 252
Total Price: 300
USD
George Washington
Graph Databases

PROVISIONING, MARKET, APPLICABILITY

NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousing
BI applications involve models; models rely on schema
Extract, transform and load (ETL) may be your friend
Wide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar
technologically

NoSQL + Big Data
Big Data and NoSQL are interrelated
Typically, Wide-Column stores used in Big Data scenarios
Prime example:
• HBase and Hadoop
Why?
• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns

Going “NoSQL-Like” on the MS Cloud
Azure Table Storage (a key-value store)
SQL Azure XML columns (supports variable schema, hierarchy)
SQL Azure Federation (a sharding implementation)
OData (HTTP/JSON data APIs)
Running NoSQL database products using Azure VMs…
34

NoSQL on Windows Azure
Platform as a Service
• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY:
• On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial
http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-
vm/

NoSQL on Windows Azure
Others, DIY (Linux VMs):
• Couchbase:
http://blog.couchbase.com/couchbase-server-new-windows-azure
• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-
installer-for-windows-azure
• Riak:
http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/
• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-
on-a-centos-linux-vm-in-windows-azure.aspx
• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-
resources/how-to-run-cassandra-with-linux/

And With MS On-Premise Technologies
SQL Server 2008/2008R2/2012 “Beyond Relational” Features
• Sparse columns (like Wide Column Stores)
• Geospatial (geometry, geography data types)
• FILESTREAM, FileTable (like Document Store attachments)
• Full Text Search, Semantic Similarity Search
• HierarchyID (can simulate Graph Database functionality)
SQL Server Parallel Data Warehouse Edition (PDW)
• Distributed architecture (like MapReduce/Hadoop)
• PolyBase in PDW v2 (interfaces PDW and HDFS)
37

Compromises
Eventual consistency
Write buffering
Only primary keys can be indexed
Queries must be written as programs
Tooling
• Productivity (= money)

Summing Up
• Line of Business -> Relational
• Large, public (consumer)-facing sites -> NoSQL
• Complex data structures -> Relational
• Big Data -> NoSQL
• Transactional -> Relational
• Content Management -> NoSQL
• Enterprise->Relational
• Consumer Web -> NoSQL

Thank you
• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Want to get on Blue Badge Insights’ list?”
Text “bluebadge” to 22828

Win a Microsoft Surface Pro!
Complete an online SESSION EVALUATION
to be entered into the draw.
Draw closes April 12, 11:59pm CT
Winners will be announced on the PASS BA
Conference website and on Twitter.
Go to passbaconference.com/evals or follow the QR code link displayed on
session signage throughout the conference venue.
Your feedback is important and valuable. All feedback will be used to improve
and select sessions for future events.

April 10-12, Chicago, IL
Thank you!
Diamond Sponsor Platinum Sponsor

NoSQL: An Analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à NoSQL: An Analysis

Similaire à NoSQL: An Analysis (20)

Plus de Andrew Brust

Plus de Andrew Brust (6)

Dernier

Dernier (20)

NoSQL: An Analysis

Notes de l'éditeur