3. Why Change? ->”Small” Big Data
Your data -
BEHAVIORAL
Your data -
TRANSACTIONAL
PUBLIC data
PREMIUM
data
4. Current Data Questions
• “Should we evaluate Hadoop?”
• “How much data is Big Data?”
• “What are the limits of SQL Server?”
• “Which NoSQL databases (if any) should we consider?”
• “How safe is the cloud really?”
• “How do we mine the data for usable information?”
6. 6
DEMO - About Open Source
• Free • Not Free
Rapid iteration, innovation
Can start up for free (on premise)
Can ‘rent’ for cheap or free on the cloud
Can use with the command line for free
Some vendors offer free online training
Ex. www.neo4j.org
Constant releases
Can be deceptively hard to set up (time is
money)
Don’t forget to turn it off if on the cloud!
GUI tools, support, training cost $$$
Ex. www.neo4j.com
7. Database Choices – The first level of choice
Data
A.
Hadoop
B. NoSQL
C.
Relational
On Premise or In the Cloud
10. How you ‘get’ Hadoop
•roll your own
A. Open source
•Cloudera
•MapR
•Hortonworks
•More…
B. Commercial distribution
•AWS
•HDInsight
C. Rent it via the cloud
14. Example Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes and greater
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
16. An Aside…SQL Server 2012++ ‘NoSQL’
• SQL Server 2012 Columnstore Index
• SQL Server 2012 Tabular Model (SSAS)
2012 2014
SSAS Tabular Models X X
NC Columnstore Index X X
Clustered (writable)
Columnstore Index
X
In-memory OLTP X
17. But wait…
is there a
RELATIONAL database
that scales,
that is cheap,
that runs in the cloud?
18. DEMO - AWS Redshift
• About $1k per Terabyte per year - relational
19. So many NoSQL options
• More than just the Elephant in the room
• Over 150+ types of NoSQL databases
31. Graph Databases
• a lot of many-to-many relationships
• recursive self-joins
• when your primary objective is quickly finding
connections, patterns and relationships
between the objects within lots of data
• Examples:
– Neo4j
– AlgebraixData
– Google Freebase
36. Cloud Offerings– RDBMS AND NoSQL
AWS Google Microsoft
Managed RDBMS RDS – all major RDBMS Cloud SQL SQL Azure
NoSQL buckets S3 or Glacier Cloud Storage Azure Blobs
NoSQL Key-Value DynamoDB Cloud Datastore Azure Tables
Streaming or ML Kinesis Prospective Search &
Prediction API
StreamInsight
NoSQL Document or Graph MongoDB on EC2
Neo4j on EC2
None
Freebase
MongoDB on Microsoft Cloud
Neo4j on Microsoft Cloud
Hadoop (HBase) Elastic MapReduce (S3 & EC2) None HDInsight
Dremel/Warehousing RedShift BigQuery None
Cloud ETL Data Pipelines None None
41. NoSQL To-Do List
Understand types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training environments
Learn NoSQL access technologies & services
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel
connectors, etc…
• Windows Azure Data Market, other public data markets
http://hortonworks.com/technology/hortonworksdataplatform/
More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report
“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.
In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”
http://www.cloudera.com/
http://hortonworks.com/technology/hortonworksdataplatform/
More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report
“Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.
In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.”
http://www.cloudera.com/
Original Reference: Tom White’s Hadoop: The Definitive Guide (I made some modifications based on my experience)
http://nosql-database.org/
http://hadoop.apache.org/ & http://www.mongodb.org/
Wikipedia - http://en.wikipedia.org/wiki/NoSQL
List of noSQL databases – http://nosql-database.org/
The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
http://code.google.com
Access via REST APIs
Very Cheap, but not much functionality included
Lots of code to write for application development
But…can be a good backup solution
http://www.infinitegraph.com/what-is-a-graph-database.html and http://www.neo4j.org/
http://en.wikipedia.org/wiki/Graph_database
http://www.freebase.com/
http://www.neo4j.org/learn/try
For Google - http://code.google.com
For AWS - https://console.aws.amazon.com/console/home
Hadoop on AWS - http://wiki.apache.org/hadoop/AmazonEC2