Contenu connexe
Similaire à Big Data and NoSQL in Microsoft-Land (20)
Big Data and NoSQL in Microsoft-Land
- 1. SQL Server Live! Orlando 2012
Big Data and NoSQL
in Microsoft-Land
Andrew Brust and Lynn Langit
Blue Badge Insights & Data Wrangler
Level: Intermediate
Meet Andrew
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 17 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer
News
• brustblog.com, Twitter: @andrewbrust
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1
- 2. SQL Server Live! Orlando 2012
Andrew’s New Blog (bit.ly/bigondata)
Meet Lynn
• CEO and Founder, Lynn Langit consulting
• Former Microsoft Evangelist (4 years)
• Google Developer Expert
• MongoDB Master
• MCT 13 years – 7 certifications
• Cloudera Certified Developer
• MSDN Magazine articles
– SQL Azure
– Hadoop on Azure
– MongoDB on Azure
• www.LynnLangit.com
• @LynnLangit
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2
- 3. SQL Server Live! Orlando 2012
Lynn’s
YouTube
Channel
• recipes)
www.TeachingKidsProgramming.org
• Free Courseware (
• Do a Recipe Teach a Kid (Ages 10 ++)
• Java or Microsoft SmallBasic
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3
- 4. SQL Server Live! Orlando 2012
Read all about it!
Agenda
• Overview / Landscape
– Big Data, and Hadoop
– NoSQL
– The Big Data-NoSQL Intersection
• Drilldown on Big Data
• Drilldown on NoSQL
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4
- 5. SQL Server Live! Orlando 2012
What is Big Data?
• 100s of TB into PB and higher
• Involving data from: financial data,
sensors, web logs, social media, etc.
• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big
Data too
• Processing of data sets too large for
transactional databases
– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on
small data problems
BigData = Exponentially More Data
• Retail Example -> ‘Feedback Economy’
– Number of transactions
– Number of behaviors (collected every minute)
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5
- 6. SQL Server Live! Orlando 2012
BigData = ‘Next State’ Questions
• What could happen?
• Why didn’t this happen?
Collecting • When will the next new thing
Behavioral happen?
data • What will the next new thing
be?
• What happens?
What’s MapReduce?
• “Big” input data as key-value pair series
• Partition the data and send to mappers
(nodes in cluster)
• Mappers pre-process, put into key-value
format, and send all output for a given (set
of) key(s) to a reducer
• Reducer aggregates; one output per key,
with value
• Map and Reduce code natively written as
Java functions
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6
- 7. SQL Server Live! Orlando 2012
MapReduce, in a Diagram
Input mapper Output
K1
Input mapper Output Input reducer Output
Output
K2
Input mapper Output Input reducer Output
Input
K3
Input mapper Output
Input reducer Output
Input mapper Output
Input mapper Output
A MapReduce Example
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Collect the tallies
• Merge tallies into one spreadsheet
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7
- 8. SQL Server Live! Orlando 2012
What’s a Distributed File System?
• One where data gets distributed over
commodity drives on commodity servers
• Data is replicated
• If one box goes down, no data lost
– “Shared Nothing”
• BUT: Immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM
Hadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS
• Have more data? Just add more nodes to
cluster.
– Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local
to mapper processing
• So, not just parallel, but minimal data
movement, which avoids network
bottlenecks
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8
- 9. SQL Server Live! Orlando 2012
Example Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Can be near immediate Has latency (due to batch processing)
Time
Just-in-time Schema
• When looking at unstructured data,
schema is imposed at query time
• Schema is context specific
– If scanning a book, are the values words, lines, or
pages?
– Are notes a single field, or is each word value?
– Are date and time two fields or one?
– Are street, city, state, zip separate or one value?
– Pig and Hive let you determine this at query time
– So does the Map function in MapReduce code
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9
- 10. SQL Server Live! Orlando 2012
What’s HBase?
• A Wide-Column Store NoSQL database
• Modeled after Google BigTable
• Uses HDFS
– Therefore, Hadoop-compatible
• Hadoop often used with HBase
– But you can use either without the other
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10
- 11. SQL Server Live! Orlando 2012
NoSQL Confusion
• Many ‘flavors’ of NoSQL data stores
• Easiest to group by functionality, but…
– Dividing lines are not clear or consistent
• NoSQL choice(s) driven by many factors
– Type of data
– Quantity of tool
– Knowledge of technical staff
– Product maturity
– Tooling
So much wrong information
People are
Everything is
religious about
‘new’
data storage
Lots of ‘Try’ before
incorrect you ‘buy’ (or
information use)
Watch out for Confusion
over over vendor
simplification offerings
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11
- 12. SQL Server Live! Orlando 2012
Common NoSQL Misconceptions
Problems
Solutions
Everything is ‘new’
People are religious about ‘Try’ before you ‘buy’ (or use)
data storage Leverage NoSQL
Open source is always communities
cheaper Add NoSQL to existing
Cloud is always cheaper RDBMS solution
Replace RDBMS with NoSQL
NoSQL + Big Data
• HBase and Cassandra work with Hadoop, are
NoSQL databases
• MongoDB brands itself a Big Data technology
• Couchbase does too
• Just-in-time schema
• MapReduce in MongoDB, others
• Hadoop and most NoSQL DBs are
partitioned, scale-out technologies
• It’s all about analytics on semi- or un-
structured data
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12
- 13. SQL Server Live! Orlando 2012
DRILLDOWN ON BIG DATA
The Hadoop Stack
Log file integration
Machine Learning/Data Mining
RDBMS Import/Export
Query: HiveQL and Pig Latin
Database
MapReduce, HDFS
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13
- 14. SQL Server Live! Orlando 2012
What’s Hive?
• Began as Hadoop sub-project
– Now top-level Apache project
• Provides a SQL-like (“HiveQL”)
abstraction over MapReduce
• Has its own HDFS table file format (and
it’s fully schema-bound)
• Can also work over HBase
• Acts as a bridge to many BI products
which expect tabular data
Hadoop Distributions
• Cloudera
• Hortonworks
– HCatalog: Hive/Pig/MR Interop
• MapR
– Network File System replaces HDFS
• IBM InfoSphere BigInsights
– HDFS<->DB2 integration
• And now Microsoft…
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14
- 15. SQL Server Live! Orlando 2012
Microsoft HDInsight
• Developed with Hortonworks and
incorporates Hortonworks Data Platform
(HDP) for Windows
• Windows Azure HDInsight and Microsoft
HDInsight (for Windows Server)
– Single node preview runs on Windows client
• Includes ODBC Driver for Hive
– And Excel Add-In that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source
Apache Project
Amenities for
Visual Studio/.NET
MRLib
(NuGet
Package)
MR code in
C#,
HadoopJob, LINQ to Hive
MapperBase,
ReducerBase
Hortonworks
Data Platform for
Windows
OdbcClient +
Debugging Hive ODBC
Driver
Deployment
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15
- 16. SQL Server Live! Orlando 2012
Some ways to work
• Microsoft HDInsight
– Cloud: go to www.hadooponazure.com, request invite
– Local: Download Microsoft HDInsight
Runs on just about anything, including Windows XP
Get it via the Web Platform installer (WebPI)
– Both are free for now; Azure HDInsight will be fee-based when
RTM
• Amazon Web Services Elastic MapReduce
– Create AWS account
– Select Elastic MapReduce in Dashboard
– Cheap for experimenting, but not free
• Cloudera CDH VM image
– Download as .tar.gz file
– “Un-tar” (can use WinRAR, 7zip)
– Run via VMWare Player or Virtual Box
– Everything’s free
Some ways to work
HDInsight EMR CDH 4
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16
- 17. SQL Server Live! Orlando 2012
Microsoft HDInsight
• Much simpler than the others
• Browser-based portal
– Launch MapReduce jobs
– Azure: Provisioning cluster, managing ports, gather external
data
• Interactive JavaScript & Hive console
– JS: HDFS, Pig, light data visualization
– Hive commands and metadata discovery
– New console coming
• Desktop Shortcuts:
– Command window, MapReduce, Name Node status in
browser
– Azure: from portal page you can RDP directly to Hadoop
head node for these desktop shortcuts
Windows Azure
HDInsight
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17
- 18. SQL Server Live! Orlando 2012
Amazon Elastic MapReduce
• Lots of steps!
• At a high level:
– Setup AWS account and S3 “buckets”
– Generate Key Pair and PEM file
– Install Ruby and EMR Command Line Interface
– Provision the cluster using CLI
A batch file can work very well here
– Setup and run SSH/PuTTY
– Work interactively at command line
Amazon EMR – Prep Steps
• Create an AWS account
• Create an S3 bucket for log storage
– with list permissions for authenticated users
• Create a Key Pair and save PEM file
• Install Ruby
• Install Amazon Web Services Elastic
MapReduce Command Line Interface
– aka AWS EMR CLI
• Create credentials.json in EMR CLI folder
– Associate with same region as where key pair created
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18
- 19. SQL Server Live! Orlando 2012
Amazon – Security and Startup
• Security
– Download PuTTYgen and run it
– Click Load and browse to PEM file
– Save it in PPK format
– Exit PuTTYgen
• In a command window, navigate to EMR CLI
folder and enter command:
– ruby elastic-mapreduce --create --alive [--num-instance xx]
[--pig-interactive] [--hive-interactive] [--hbase --instance-type
m1.large]
• In AWS Console, go to EC2 Dashboard and
click Instances on left nav bar
• Wait until instance is running and get its
Public DNS name
– Use Compatibility View in IE or copy may not work
Connect!
• Download and run PuTTY
• Paste DNS name of EC2 instance into hostname
field
• In Treeview, drill down and navigate to
ConnectionSSHAuth, browse to PPK file
• Once EC2 instance(s) running, click Open
• Click Yes to “The server’s host key is not cached
in the registry…” PuTTY Security Alert
• When prompted for user name, type “hadoop” and
hit Enter
• cd bin, then hive, pig, hbase shell
• Right-click to paste from clipboard; option to go
full-screen
• (Kill EC2 instance(s) from Dashboard when done)
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19
- 20. SQL Server Live! Orlando 2012
Amazon Elastic MapReduce
Cloudera CDH4 Virtual Machine
• Get it for free, in VMWare and Virtual Box
versions.
– VMWare player and Virtual Box are free too
• Run it, and configure it to have its own IP on
your network. Use ifconfig to discover IP.
• Assuming IP of 192.168.1.59, open browser on
your own (host) machine and navigate to:
– http://192.168.1.59:8888
• Can also use browser in VM and hit:
– http://localhost:8888
• Work in “Hue”…
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20
- 21. SQL Server Live! Orlando 2012
Hue
• Browser based UI,
with front ends
for:
– HDFS (w/ upload &
download)
– MapReduce job
creation and
monitoring
– Hive (“Beeswax”)
• And in-browser
command line
shells for:
– HBase
– Pig (“Grunt”)
Impala: What it Is
• Distributed SQL query engine over
Hadoop cluster
• Announced at Strata/Hadoop World in NYC
on October 24th
• In Beta, as part of CDH 4.1
• Works with HDFS and Hive data
• Compatible with HiveQL and Hive drivers
– Query with Beeswax
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21
- 22. SQL Server Live! Orlando 2012
Impala: What it’s Not
• Impala is not Hive
– Hive converts HiveQL to Java MapReduce code and
executes it in batch mode
– Impala executes query interactively over the data
– Brings BI tools and Hadoop closer together
• Impala is not an Apache Software
Foundation project
– Though it is open source and Apache-licensed, but
it’s still incubated by Cloudera
– Only in CDH
Cloudera CDH4
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22
- 23. SQL Server Live! Orlando 2012
Hadoop commands
• HDFS
– hadoop fs filecommand
– Create and remove directories:
mkdir, rm, rmr
– Upload and download files to/from HDFS
get, put
– View directory contents
ls, lsr
– Copy, move, view files
cp, mv, cat
• MapReduce
– Run a Java jar-file based job
hadoop jar jarname params
Hadoop (directly)
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23
- 24. SQL Server Live! Orlando 2012
HBase
• Concepts:
– Tables, column families
– Columns, rows
– Keys, values
• Commands:
– Definition: create, alter, drop, truncate
– Manipulation: get, put, delete, deleteall, scan
– Discovery: list, exists, describe, count
– Enablement: disable, enable
– Utilities: version, status, shutdown, exit
– Reference: http://wiki.apache.org/hadoop/Hbase/Shell
• Moreover,
– Interesting HBase work can be done in MapReduce, Pig
HBase Examples
• create 't1', 'f1', 'f2', 'f3'
• describe 't1'
• alter 't1', {NAME => 'f1',
VERSIONS => 5}
• put 't1', 'r1', 'c1:f1', 'value'
• get 't1', 'r1'
• count 't1'
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24
- 25. SQL Server Live! Orlando 2012
HBase
Submitting, Running and
Monitoring Jobs
• Upload a JAR
• Use Streaming
– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On MS HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name
and params) or use GUI
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25
- 26. SQL Server Live! Orlando 2012
Running MapReduce
Jobs
Hive
• Used by most BI products which connect
to Hadoop
• Provides a SQL-like abstraction over
Hadoop
– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output of
which becomes result set
• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26
- 27. SQL Server Live! Orlando 2012
Hive, Continued
• Load data from flat HDFS files
– LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
• SQL Queries
– CREATE, ALTER, DROP
– INSERT OVERWRITE (creates whole tables)
– SELECT, JOIN, WHERE, GROUP BY
– SORT BY, but ordering data is tricky!
– MAP/REDUCE/TRANSFORM…USING allows for custom
map, reduce steps utilizing Java or streaming code
Excel Add-In for Hive
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27
- 28. SQL Server Live! Orlando 2012
Hive
Pig
• Instead of SQL, employs a language (“Pig
Latin”) that accommodates data flow
expressions
– Do a combo of Query and ETL
• “10 lines of Pig Latin ≈ 200 lines of Java.”
• Works with structured or unstructured data
• Operations
– As with Hive, a MapReduce job is generated
– Unlike Hive, output is only flat file to HDFS or text at
command line console
– With MS Hadoop, can easily convert to JavaScript array,
then manipulate
• Use command line (“Grunt”) or build scripts
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28
- 29. SQL Server Live! Orlando 2012
Example
• A = LOAD 'myfile'
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO 'output';
Pig Latin Examples
• Imperative, file system commands
– LOAD, STORE
Schema specified on LOAD
• Declarative, query commands (SQL-like)
– xxx = file or data set
– FOREACH xxx GENERATE (SELECT…FROM xxx)
– JOIN (WHERE/INNER JOIN)
– FILTER xxx BY (WHERE)
– ORDER xxx BY (ORDER BY)
– GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)
– DISTINCT (SELECT DISTINCT)
• Syntax is assignment statement-based:
– MyCusts = FILTER Custs BY SalesPerson eq 15;
• Access Hbase
– CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp
u:','-loadKey -returnTuple');
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29
- 30. SQL Server Live! Orlando 2012
Pig
Sqoop
sqoop import
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <from_table>
--target-dir <to_hdfs_folder>
--split-by <from_table_column>
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30
- 31. SQL Server Live! Orlando 2012
Sqoop
sqoop export
--connect
"jdbc:sqlserver://<servername>.
database.windows.net:1433;
database=<dbname>;
user=<username>@<servername>;
password=<password>"
--table <to_table>
--export-dir <from_hdfs_folder>
--input-fields-terminated-by
"<delimiter>"
Flume NG
• Source
– Avro (data serialization system – can read json-
encoded data files, and can work over RPC)
– Exec (reads from stdout of long-running process)
• Sinks
– HDFS, HBase, Avro
• Channels
– Memory, JDBC, file
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31
- 32. SQL Server Live! Orlando 2012
Flume NG (next generation)
• Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
• From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Mahout Algorithms
• Recommendation
– Your info + community info
– Give users/items/ratings; get user-user/item-item
– itemsimilarity
• Classification/Categorization
– Drop into buckets
– Naïve Bayes, Complementary Naïve Bayes, Decision
Forests
• Clustering
– Like classification, but with categories unknown
– K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
Shift
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32
- 33. SQL Server Live! Orlando 2012
Workflow, Syntax
• Workflow
– Run the job
– Dump the output
– Visualize, predict
• mahout algorithm
-- input folderspec
-- output folderspec
-- param1 value1
-- param2 value2
…
• Example:
– mahout itemsimilarity
--input <input-hdfs-path>
--output <output-hdfs-path>
--tempDir <tmp-hdfs-path>
-s SIMILARITY_LOGLIKELIHOOD
The Truth About Mahout
• Mahout is really just an algorithm engine
• Its output is almost unusable by non-
statisticians/non-data scientists
• You need a staff or a product to visualize, or
make into a usable prediction model
• Investigate Predixion Software
– CTO, Jamie MacLennan, used to lead SQL Server Data
Mining team
– Excel add-in can use Mahout remotely, visualize its output,
run predictive analyses
– Also integrates with SQL Server, Greenplum, MapReduce
– http://www.predixionsoftware.com
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33
- 34. SQL Server Live! Orlando 2012
The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured
data, then extract manageable subsets
• Load the subsets into conventional DW/BI
servers and use familiar analytics tool to
examine
• This is the current rationalization of
Hadoop + BI tools’ coexistence
• Will it stay this way?
DRILLDOWN ON NOSQL
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34
- 35. SQL Server Live! Orlando 2012
Hitting (Relational) Walls
• CA
– Highly-available consistency
• CP
– Enforced consistency
• AP
– Eventual consistency
The reality…two pivots
Storage Storage
Methods Locations
• SQL (RDBMS) • On premises
• NoSQL • Cloud-hosted
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35
- 36. SQL Server Live! Orlando 2012
So many NoSQL options
• More than just the Elephant in the room
• Over 120+ types of noSQL databases
Flavors of NoSQL
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36
- 37. SQL Server Live! Orlando 2012
Graph Database
Use for data with
– a lot of many-to-many relationships
– recursive self-joins
– when your primary objective is quickly
finding connections, patterns and
relationships between the objects within
lots of data
– Examples: Neo4J, FreeBase (Google)
Column Database
• Wide, sparse column sets
• Schema-light
• Examples:
– Cassandra
– HBase
– BigTable
– GAE HR DS
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37
- 38. SQL Server Live! Orlando 2012
More about Column Databases
• Type A
– Column-families
– Non-relational
– Sparse
– Examples: HBase, Cassandra, xVelocity (SQL 2012
BISM)
• Type B
– Column-stores
– Relational
– Dense
– Example:
SQL Server 2012 Columnstore index
Demo - Document Database (MongoDB)
• Use for data that is
– document-oriented (collection of
JSON documents) w/semi structured
data
Encodings include XML, YAML, JSON
& BSON
– binary forms
PDF, Microsoft Office documents -- Word,
Excel…)
• Examples: MongoDB,
CouchDB
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38
- 39. SQL Server Live! Orlando 2012
Demo
MongoDB
Persistent Key / Value Database
• Schema-less
• State - Persistent
• Examples
– AWS DynamoDB
– Azure Tables
– Project Voldemort
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39
- 40. SQL Server Live! Orlando 2012
Volatile Key / Value Database
• Schema-less
• State - Volatile
• Examples
– Redis
– Memcahed
Which type of NoSQL for which
type of data?
Type of Data Type of NoSQL Example
solution
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media Graph Neo4j
connections
LOB w/Transactions NONE! Use RDBMS SQL Server
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40
- 41. SQL Server Live! Orlando 2012
What about the cloud?
Cloud-hosted NoSQL up to 50x CHEAPER
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41
- 42. SQL Server Live! Orlando 2012
Consumer Storage Buckets
• Dropbox
• Box
• Windows SkyDrive
• Google Drive
• Amazon Cloud Drive
• Apple iCloud
Developer BLOB Storage Buckets
• Amazon – S3 or Glacier
• Google – Cloud Storage
• Microsoft Azure BLOBS
• Others
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42
- 43. SQL Server Live! Orlando 2012
Cloud-hosted RDBMS
• AWS RDS – SQL
Server, MySQL, Oracle
– Medium cost
– Solid feature set, i.e.
backup, snapshot
– Use existing tooling
• Google – MySQL
– Lowest cost
– Most limited RDBMS
functionality
• Microsoft – Windows
Azure SQL Database
– Highest cost
– Azure VMs w/MySQL
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43
- 44. SQL Server Live! Orlando 2012
Other cloud data services
Hosting public datasets
• Pay to read
• Earn revenue by offering for read
Cleaning / matching (your) data
• ETL – Microsoft Data Explorer, Google Refine
• Data Quality – Windows Azure Marketplace,
InfoChimps, DataMarket.com
Cloud – RDBMS, NoSQL & Hadoop
AWS Google Microsoft
Cloud RDBMS SQL Server, Oracle MySQL SQL Azure
/ mySQL
NoSQL buckets S3 or Glacier Cloud Storage Azure Storage
NoSQL databases DynamoDB H/R Datastore on Azure Tables
GAE
Streaming Custom EC2 Prospective StreamInsight &
Machine Learning Search & Mahout with
Prediction API Hadoop
Document or MongoDB on EC2 Freebase (g) MongoDB on
Graph Windows Azure
Hadoop Elastic MapReduce MapR & GCE Windows Azure
using S3 & EC2 HDInsight
Data sets & other Karmasphere Translation API Azure DataMarket
Full-text search
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44
- 45. SQL Server Live! Orlando 2012
Demo
Amazon RDS
Pick your mix and then…
• Use Cloud Data
Markets
Other • Use Cloud ETL
Services
RDBMS
• Host locally
• Host in the
Cloud NoSQL
• Host locally
• Host in the
Cloud
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45
- 46. SQL Server Live! Orlando 2012
What about me?
Common DBA Tasks in NoSQL
RDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46
- 47. SQL Server Live! Orlando 2012
Making Sense – Asking Questions
Data Scientists…
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47
- 48. SQL Server Live! Orlando 2012
Comparing…
Karmasphere Studio for AWS
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48
- 49. SQL Server Live! Orlando 2012
Google BigQuery w/Excel
• Dremel-based service
– For massive amounts of data
– BigQuery currently has quota limits
– SQL-like query language
Demo
Google Big Query
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49
- 50. SQL Server Live! Orlando 2012
NoSQL To-Do List
Understand CAP & types of NoSQL databases
• Use NoSQL when business needs designate
• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud
• Quick and cheap for behavioral data
• Mashup cloud datasets
• Good for specialized use cases, i.e. dev, test , training
environments
Learn noSQL access technologies
• New query languages, i.e. MapReduce, R, Infer.NET
• New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape
Other
Services
RDBMS
NoSQL
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50
- 51. SQL Server Live! Orlando 2012
NoSQL for .NET Developers
• RavenDB
• MongoDB C#/.NET Driver
• MongoDB on Windows Azure
• CouchBase .NET Client Library
• Riak client for .NET
• AWS Toolkit for Visual Studio
• Google cloud APIs (REST-based)
SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51