SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
SQL Server Live! Orlando 2012




                                   Big Data and NoSQL
                                   in Microsoft-Land
                                                               Andrew Brust and Lynn Langit
                                                                    Blue Badge Insights & Data Wrangler


                                                                                             Level: Intermediate




                               Meet Andrew
                                    •    CEO and Founder, Blue Badge Insights
                                    •    Big Data blogger for ZDNet
                                    •    Microsoft Regional Director, MVP
                                    •    Co-chair VSLive! and 17 years as a speaker
                                    •    Founder, Microsoft BI User Group of NYC
                                          – http://www.msbinyc.com
                                    •    Co-moderator, NYC .NET Developers Group
                                          – http://www.nycdotnetdev.com
                                    •    “Redmond Review” columnist for
                                         Visual Studio Magazine and Redmond Developer
                                         News
                                    •    brustblog.com, Twitter: @andrewbrust




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   1
SQL Server Live! Orlando 2012




                                    Andrew’s New Blog (bit.ly/bigondata)




                               Meet Lynn
                                    •    CEO and Founder, Lynn Langit consulting
                                    •    Former Microsoft Evangelist (4 years)
                                    •    Google Developer Expert
                                    •    MongoDB Master
                                    •    MCT 13 years – 7 certifications
                                    •    Cloudera Certified Developer
                                    •    MSDN Magazine articles
                                          – SQL Azure
                                          – Hadoop on Azure
                                          – MongoDB on Azure
                                    •    www.LynnLangit.com
                                    •    @LynnLangit




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   2
SQL Server Live! Orlando 2012




                                    Lynn’s
                                    YouTube
                                    Channel




                               • recipes)




                                      www.TeachingKidsProgramming.org
                                        •   Free Courseware (
                                        •   Do a Recipe  Teach a Kid (Ages 10 ++)
                                        •   Java or Microsoft SmallBasic 




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   3
SQL Server Live! Orlando 2012




                                    Read all about it!




                                    Agenda
                                     •   Overview / Landscape
                                          – Big Data, and Hadoop
                                          – NoSQL
                                          – The Big Data-NoSQL Intersection
                                     •   Drilldown on Big Data
                                     •   Drilldown on NoSQL




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   4
SQL Server Live! Orlando 2012




                                     What is Big Data?
                                     •     100s of TB into PB and higher
                                     •     Involving data from: financial data,
                                           sensors, web logs, social media, etc.
                                     •     Parallel processing often involved
                                            – Hadoop is emblematic, but other technologies are Big
                                              Data too
                                     •     Processing of data sets too large for
                                           transactional databases
                                            – Analyzing interactions, rather than transactions
                                            – The three V’s: Volume, Velocity, Variety
                                     •     Big Data tech sometimes imposed on
                                           small data problems




                               BigData = Exponentially More Data
                                 •   Retail Example -> ‘Feedback Economy’
                                         – Number of transactions
                                         – Number of behaviors (collected every minute)




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   5
SQL Server Live! Orlando 2012




                               BigData = ‘Next State’ Questions



                                                                • What could happen?
                                                                • Why didn’t this happen?
                                   Collecting                   • When will the next new thing
                                  Behavioral                      happen?
                                     data                       • What will the next new thing
                                                                  be?
                                                                • What happens?




                                    What’s MapReduce?
                                     •   “Big” input data as key-value pair series
                                     •   Partition the data and send to mappers
                                         (nodes in cluster)
                                     •   Mappers pre-process, put into key-value
                                         format, and send all output for a given (set
                                         of) key(s) to a reducer
                                     •   Reducer aggregates; one output per key,
                                         with value
                                     •   Map and Reduce code natively written as
                                         Java functions




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   6
SQL Server Live! Orlando 2012




                                    MapReduce, in a Diagram


                                               Input   mapper       Output

                                                                             K1

                                               Input   mapper       Output    Input   reducer      Output


                                                                                                             Output
                                                                             K2
                                               Input   mapper       Output    Input   reducer      Output
                                 Input
                                                                             K3
                                               Input   mapper       Output
                                                                              Input   reducer      Output


                                               Input   mapper       Output


                                               Input   mapper       Output




                                    A MapReduce Example


                                                              • Count by suite, on each floor

                                                              • Send per-suite, per platform totals to lobby

                                                              • Sort totals by platform

                                                              • Send two platform packets to 10th, 20th, 30th floor

                                                              • Tally up each platform

                                                              • Collect the tallies

                                                              • Merge tallies into one spreadsheet




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   7
SQL Server Live! Orlando 2012




                                    What’s a Distributed File System?
                                     •   One where data gets distributed over
                                         commodity drives on commodity servers
                                     •   Data is replicated
                                     •   If one box goes down, no data lost
                                          – “Shared Nothing”
                                     •   BUT: Immutable
                                          – Files can only be written to once
                                          – So updates require drop + re-write (slow)
                                          – You can append though
                                          – Like a DVD/CD-ROM




                                    Hadoop = MapReduce + HDFS
                                     •   Modeled after Google MapReduce + GFS
                                     •   Have more data? Just add more nodes to
                                         cluster.
                                          – Mappers execute in parallel
                                          – Hardware is commodity
                                          – “Scaling out”
                                     •   Use of HDFS means data may well be local
                                         to mapper processing
                                     •   So, not just parallel, but minimal data
                                         movement, which avoids network
                                         bottlenecks




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   8
SQL Server Live! Orlando 2012




                               Example Comparison: RDBMS vs. Hadoop

                                                  Traditional RDBMS              Hadoop / MapReduce

                                Data Size         Gigabytes (Terabytes)          Petabytes (Hexabytes)

                                Access            Interactive and Batch          Batch – NOT Interactive

                                Updates           Read / Write many times        Write once, Read many times

                                Structure         Static Schema                  Dynamic Schema

                                Integrity         High (ACID)                    Low

                                Scaling           Nonlinear                      Linear

                                Query Response    Can be near immediate          Has latency (due to batch processing)
                                Time




                                     Just-in-time Schema
                                     •      When looking at unstructured data,
                                            schema is imposed at query time
                                     •      Schema is context specific
                                            – If scanning a book, are the values words, lines, or
                                              pages?
                                            – Are notes a single field, or is each word value?
                                            – Are date and time two fields or one?
                                            – Are street, city, state, zip separate or one value?
                                            – Pig and Hive let you determine this at query time
                                            – So does the Map function in MapReduce code




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   9
SQL Server Live! Orlando 2012




                                    What’s HBase?
                                     •   A Wide-Column Store NoSQL database
                                     •   Modeled after Google BigTable
                                     •   Uses HDFS
                                          – Therefore, Hadoop-compatible
                                     •   Hadoop often used with HBase
                                          – But you can use either without the other




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   10
SQL Server Live! Orlando 2012




                                    NoSQL Confusion
                                     •   Many ‘flavors’ of NoSQL data stores
                                     •   Easiest to group by functionality, but…
                                          – Dividing lines are not clear or consistent
                                     •   NoSQL choice(s) driven by many factors
                                          – Type of data
                                          – Quantity of tool
                                          – Knowledge of technical staff
                                          – Product maturity
                                          – Tooling




                                    So much wrong information

                                                                               People are
                                                   Everything is
                                                                             religious about
                                                      ‘new’
                                                                              data storage

                                                       Lots of                ‘Try’ before
                                                      incorrect               you ‘buy’ (or
                                                    information                   use)

                                                  Watch out for                 Confusion
                                                      over                     over vendor
                                                  simplification                offerings




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   11
SQL Server Live! Orlando 2012




                                    Common NoSQL Misconceptions


                                    Problems
                                                                        Solutions
                                    Everything is ‘new’
                                    People are religious about          ‘Try’ before you ‘buy’ (or use)
                                    data storage                        Leverage NoSQL
                                    Open source is always               communities
                                    cheaper                             Add NoSQL to existing
                                    Cloud is always cheaper             RDBMS solution
                                    Replace RDBMS with NoSQL




                                    NoSQL + Big Data
                                     •   HBase and Cassandra work with Hadoop, are
                                         NoSQL databases
                                     •   MongoDB brands itself a Big Data technology
                                     •   Couchbase does too
                                     •   Just-in-time schema
                                     •   MapReduce in MongoDB, others
                                     •   Hadoop and most NoSQL DBs are
                                         partitioned, scale-out technologies
                                     •   It’s all about analytics on semi- or un-
                                         structured data




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   12
SQL Server Live! Orlando 2012




                                   DRILLDOWN ON BIG DATA




                                    The Hadoop Stack
                                       Log file integration



                                       Machine Learning/Data Mining

                                       RDBMS Import/Export

                                       Query: HiveQL and Pig Latin

                                       Database

                                       MapReduce, HDFS




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   13
SQL Server Live! Orlando 2012




                                    What’s Hive?
                                     •   Began as Hadoop sub-project
                                          – Now top-level Apache project
                                     •   Provides a SQL-like (“HiveQL”)
                                         abstraction over MapReduce
                                     •   Has its own HDFS table file format (and
                                         it’s fully schema-bound)
                                     •   Can also work over HBase
                                     •   Acts as a bridge to many BI products
                                         which expect tabular data




                                    Hadoop Distributions
                                     •   Cloudera
                                     •   Hortonworks
                                          – HCatalog: Hive/Pig/MR Interop
                                     •   MapR
                                          – Network File System replaces HDFS
                                     •   IBM InfoSphere BigInsights
                                          – HDFS<->DB2 integration
                                     •   And now Microsoft…




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   14
SQL Server Live! Orlando 2012




                                    Microsoft HDInsight
                                     •   Developed with Hortonworks and
                                         incorporates Hortonworks Data Platform
                                         (HDP) for Windows
                                     •   Windows Azure HDInsight and Microsoft
                                         HDInsight (for Windows Server)
                                          – Single node preview runs on Windows client
                                     •   Includes ODBC Driver for Hive
                                          – And Excel Add-In that uses it
                                     •   JavaScript MapReduce framework
                                     •   Contribute it all back to open source
                                         Apache Project




                                    Amenities for
                                    Visual Studio/.NET

                                                                       MRLib
                                                                      (NuGet
                                                                     Package)
                                                     MR code in
                                                        C#,
                                                    HadoopJob,                        LINQ to Hive
                                                    MapperBase,
                                                    ReducerBase
                                                                   Hortonworks
                                                                  Data Platform for
                                                                     Windows

                                                                                      OdbcClient +
                                                     Debugging                        Hive ODBC
                                                                                         Driver



                                                                    Deployment




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   15
SQL Server Live! Orlando 2012




                                    Some ways to work
                                     •   Microsoft HDInsight
                                          – Cloud: go to www.hadooponazure.com, request invite
                                          – Local: Download Microsoft HDInsight
                                               Runs on just about anything, including Windows XP
                                               Get it via the Web Platform installer (WebPI)
                                          – Both are free for now; Azure HDInsight will be fee-based when
                                            RTM
                                     •   Amazon Web Services Elastic MapReduce
                                          – Create AWS account
                                          – Select Elastic MapReduce in Dashboard
                                          – Cheap for experimenting, but not free
                                     •   Cloudera CDH VM image
                                          –   Download as .tar.gz file
                                          –   “Un-tar” (can use WinRAR, 7zip)
                                          –   Run via VMWare Player or Virtual Box
                                          –   Everything’s free




                                    Some ways to work




                                              HDInsight               EMR                 CDH 4




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   16
SQL Server Live! Orlando 2012




                                    Microsoft HDInsight
                                     •   Much simpler than the others
                                     •   Browser-based portal
                                          – Launch MapReduce jobs
                                          – Azure: Provisioning cluster, managing ports, gather external
                                            data
                                     •   Interactive JavaScript & Hive console
                                          – JS: HDFS, Pig, light data visualization
                                          – Hive commands and metadata discovery
                                          – New console coming
                                     •   Desktop Shortcuts:
                                          – Command window, MapReduce, Name Node status in
                                            browser
                                          – Azure: from portal page you can RDP directly to Hadoop
                                            head node for these desktop shortcuts




                                     Windows Azure
                                     HDInsight




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   17
SQL Server Live! Orlando 2012




                                    Amazon Elastic MapReduce
                                     •   Lots of steps!
                                     •   At a high level:
                                          – Setup AWS account and S3 “buckets”
                                          – Generate Key Pair and PEM file
                                          – Install Ruby and EMR Command Line Interface
                                          – Provision the cluster using CLI
                                              A batch file can work very well here
                                          – Setup and run SSH/PuTTY
                                          – Work interactively at command line




                                    Amazon EMR – Prep Steps
                                     •   Create an AWS account
                                     •   Create an S3 bucket for log storage
                                          – with list permissions for authenticated users
                                     •   Create a Key Pair and save PEM file
                                     •   Install Ruby
                                     •   Install Amazon Web Services Elastic
                                         MapReduce Command Line Interface
                                          – aka AWS EMR CLI 
                                     •   Create credentials.json in EMR CLI folder
                                          – Associate with same region as where key pair created




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   18
SQL Server Live! Orlando 2012




                                    Amazon – Security and Startup
                                     •   Security
                                          –   Download PuTTYgen and run it
                                          –   Click Load and browse to PEM file
                                          –   Save it in PPK format
                                          –   Exit PuTTYgen
                                     •   In a command window, navigate to EMR CLI
                                         folder and enter command:
                                          – ruby elastic-mapreduce --create --alive [--num-instance xx]
                                            [--pig-interactive] [--hive-interactive] [--hbase --instance-type
                                            m1.large]
                                     •   In AWS Console, go to EC2 Dashboard and
                                         click Instances on left nav bar
                                     •   Wait until instance is running and get its
                                         Public DNS name
                                          – Use Compatibility View in IE or copy may not work




                                    Connect!
                                     •   Download and run PuTTY
                                     •   Paste DNS name of EC2 instance into hostname
                                         field
                                     •   In Treeview, drill down and navigate to
                                         ConnectionSSHAuth, browse to PPK file
                                     •   Once EC2 instance(s) running, click Open
                                     •   Click Yes to “The server’s host key is not cached
                                         in the registry…” PuTTY Security Alert
                                     •   When prompted for user name, type “hadoop” and
                                         hit Enter
                                     •   cd bin, then hive, pig, hbase shell
                                     •   Right-click to paste from clipboard; option to go
                                         full-screen
                                     •   (Kill EC2 instance(s) from Dashboard when done)




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   19
SQL Server Live! Orlando 2012




                                     Amazon Elastic MapReduce




                                    Cloudera CDH4 Virtual Machine
                                     •   Get it for free, in VMWare and Virtual Box
                                         versions.
                                          – VMWare player and Virtual Box are free too
                                     •   Run it, and configure it to have its own IP on
                                         your network. Use ifconfig to discover IP.
                                     •   Assuming IP of 192.168.1.59, open browser on
                                         your own (host) machine and navigate to:
                                          – http://192.168.1.59:8888
                                     •   Can also use browser in VM and hit:
                                          – http://localhost:8888
                                     •   Work in “Hue”…




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   20
SQL Server Live! Orlando 2012




                                    Hue
                                     •   Browser based UI,
                                         with front ends
                                         for:
                                          – HDFS (w/ upload &
                                            download)
                                          – MapReduce job
                                            creation and
                                            monitoring
                                          – Hive (“Beeswax”)
                                     •   And in-browser
                                         command line
                                         shells for:
                                          – HBase
                                          – Pig (“Grunt”)




                                    Impala: What it Is
                                     •   Distributed SQL query engine over
                                         Hadoop cluster
                                     •   Announced at Strata/Hadoop World in NYC
                                         on October 24th
                                     •   In Beta, as part of CDH 4.1
                                     •   Works with HDFS and Hive data
                                     •   Compatible with HiveQL and Hive drivers
                                          – Query with Beeswax




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   21
SQL Server Live! Orlando 2012




                                    Impala: What it’s Not
                                     •   Impala is not Hive
                                          – Hive converts HiveQL to Java MapReduce code and
                                            executes it in batch mode
                                          – Impala executes query interactively over the data
                                          – Brings BI tools and Hadoop closer together
                                     •   Impala is not an Apache Software
                                         Foundation project
                                          – Though it is open source and Apache-licensed, but
                                            it’s still incubated by Cloudera
                                          – Only in CDH




                                     Cloudera CDH4




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   22
SQL Server Live! Orlando 2012




                                    Hadoop commands
                                     •   HDFS
                                          – hadoop fs filecommand
                                          – Create and remove directories:
                                              mkdir, rm, rmr
                                          – Upload and download files to/from HDFS
                                              get, put
                                          – View directory contents
                                              ls, lsr
                                          – Copy, move, view files
                                              cp, mv, cat
                                     •   MapReduce
                                          – Run a Java jar-file based job
                                              hadoop jar jarname params




                                     Hadoop (directly)




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   23
SQL Server Live! Orlando 2012




                                    HBase
                                     •   Concepts:
                                          – Tables, column families
                                          – Columns, rows
                                          – Keys, values
                                     •   Commands:
                                          –   Definition: create, alter, drop, truncate
                                          –   Manipulation: get, put, delete, deleteall, scan
                                          –   Discovery: list, exists, describe, count
                                          –   Enablement: disable, enable
                                          –   Utilities: version, status, shutdown, exit
                                          –   Reference: http://wiki.apache.org/hadoop/Hbase/Shell
                                     •   Moreover,
                                          – Interesting HBase work can be done in MapReduce, Pig




                                    HBase Examples
                                     •   create 't1', 'f1', 'f2', 'f3'
                                     •   describe 't1'
                                     •   alter 't1', {NAME => 'f1',
                                         VERSIONS => 5}
                                     •   put 't1', 'r1', 'c1:f1', 'value'
                                     •   get 't1', 'r1'
                                     •   count 't1'




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   24
SQL Server Live! Orlando 2012




                                     HBase




                                    Submitting, Running and
                                    Monitoring Jobs
                                     •   Upload a JAR
                                     •   Use Streaming
                                          – Use other languages (i.e. other than Java) to write
                                            MapReduce code
                                          – Python is popular option
                                          – Any executable works, even C# console apps
                                          – On MS HDInsight, JavaScript works too
                                          – Still uses a JAR file: streaming.jar
                                     •   Run at command line (passing JAR name
                                         and params) or use GUI




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   25
SQL Server Live! Orlando 2012




                                     Running MapReduce
                                     Jobs




                                    Hive
                                     •   Used by most BI products which connect
                                         to Hadoop
                                     •   Provides a SQL-like abstraction over
                                         Hadoop
                                          – Officially HiveQL, or HQL
                                     •   Works on own tables, but also on HBase
                                     •   Query generates MapReduce job, output of
                                         which becomes result set
                                     •   Microsoft has Hive ODBC driver
                                          – Connects Excel, Reporting Services, PowerPivot,
                                            Analysis Services Tabular Mode (only)




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   26
SQL Server Live! Orlando 2012




                                    Hive, Continued
                                     •   Load data from flat HDFS files
                                          – LOAD DATA [LOCAL] INPATH 'myfile'
                                            INTO TABLE mytable;
                                     •   SQL Queries
                                          – CREATE, ALTER, DROP
                                          – INSERT OVERWRITE (creates whole tables)
                                          – SELECT, JOIN, WHERE, GROUP BY
                                          – SORT BY, but ordering data is tricky!
                                          – MAP/REDUCE/TRANSFORM…USING allows for custom
                                            map, reduce steps utilizing Java or streaming code




                              Excel Add-In for Hive




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   27
SQL Server Live! Orlando 2012




                                     Hive




                                    Pig
                                     •   Instead of SQL, employs a language (“Pig
                                         Latin”) that accommodates data flow
                                         expressions
                                          – Do a combo of Query and ETL
                                     •   “10 lines of Pig Latin ≈ 200 lines of Java.”
                                     •   Works with structured or unstructured data
                                     •   Operations
                                          – As with Hive, a MapReduce job is generated
                                          – Unlike Hive, output is only flat file to HDFS or text at
                                            command line console
                                          – With MS Hadoop, can easily convert to JavaScript array,
                                            then manipulate
                                     •   Use command line (“Grunt”) or build scripts




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   28
SQL Server Live! Orlando 2012




                                    Example
                                     •   A = LOAD 'myfile'
                                           AS (x, y, z);
                                         B = FILTER A by x > 0;
                                         C = GROUP B BY x;
                                         D = FOREACH A GENERATE
                                           x, COUNT(B);
                                         STORE D INTO 'output';




                                    Pig Latin Examples
                                     •   Imperative, file system commands
                                          – LOAD, STORE
                                              Schema specified on LOAD
                                     •   Declarative, query commands (SQL-like)
                                          – xxx = file or data set
                                          – FOREACH xxx GENERATE (SELECT…FROM xxx)
                                          – JOIN (WHERE/INNER JOIN)
                                          – FILTER xxx BY (WHERE)
                                          – ORDER xxx BY (ORDER BY)
                                          – GROUP xxx BY / GENERATE COUNT(xxx)
                                            (SELECT COUNT(*) GROUP BY)
                                          – DISTINCT (SELECT DISTINCT)
                                     •   Syntax is assignment statement-based:
                                          – MyCusts = FILTER Custs BY SalesPerson eq 15;
                                     •   Access Hbase
                                          – CpuMetrics = LOAD 'hbase://SystemMetrics' USING
                                            org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp
                                            u:','-loadKey -returnTuple');




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   29
SQL Server Live! Orlando 2012




                                     Pig




                                    Sqoop
                                     sqoop import
                                      --connect
                                       "jdbc:sqlserver://<servername>.
                                        database.windows.net:1433;
                                        database=<dbname>;
                                        user=<username>@<servername>;
                                        password=<password>"
                                      --table <from_table>
                                      --target-dir <to_hdfs_folder>
                                      --split-by <from_table_column>




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   30
SQL Server Live! Orlando 2012




                                    Sqoop
                                     sqoop export
                                      --connect
                                       "jdbc:sqlserver://<servername>.
                                        database.windows.net:1433;
                                        database=<dbname>;
                                        user=<username>@<servername>;
                                        password=<password>"
                                      --table <to_table>
                                      --export-dir <from_hdfs_folder>
                                      --input-fields-terminated-by
                                       "<delimiter>"




                                    Flume NG
                                     •   Source
                                          – Avro (data serialization system – can read json-
                                            encoded data files, and can work over RPC)
                                          – Exec (reads from stdout of long-running process)
                                     •   Sinks
                                          – HDFS, HBase, Avro
                                     •   Channels
                                          – Memory, JDBC, file




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   31
SQL Server Live! Orlando 2012




                                    Flume NG (next generation)
                                     •     Setup conf/flume.conf
                                     # Define a memory channel called ch1 on agent1
                                     agent1.channels.ch1.type = memory

                                     # Define an Avro source called avro-source1 on agent1 and tell it
                                     # to bind to 0.0.0.0:41414. Connect it to channel ch1.
                                     agent1.sources.avro-source1.channels = ch1
                                     agent1.sources.avro-source1.type = avro
                                     agent1.sources.avro-source1.bind = 0.0.0.0
                                     agent1.sources.avro-source1.port = 41414

                                     # Define a logger sink that simply logs all events it receives
                                     # and connect it to the other end of the same channel.
                                     agent1.sinks.log-sink1.channel = ch1
                                     agent1.sinks.log-sink1.type = logger

                                     # Finally, now that we've defined all of our components, tell
                                     # agent1 which ones we want to activate.
                                     agent1.channels = ch1
                                     agent1.sources = avro-source1
                                     agent1.sinks = log-sink1



                                     •     From the command line:
                                     flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1




                                    Mahout Algorithms
                                     •     Recommendation
                                            – Your info + community info
                                            – Give users/items/ratings; get user-user/item-item
                                            – itemsimilarity
                                     •     Classification/Categorization
                                            – Drop into buckets
                                            – Naïve Bayes, Complementary Naïve Bayes, Decision
                                              Forests
                                     •     Clustering
                                            – Like classification, but with categories unknown
                                            – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
                                              Shift




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   32
SQL Server Live! Orlando 2012




                                    Workflow, Syntax
                                     •   Workflow
                                          – Run the job
                                          – Dump the output
                                          – Visualize, predict
                                     •   mahout algorithm
                                           -- input folderspec
                                           -- output folderspec
                                           -- param1 value1
                                           -- param2 value2
                                         …
                                     •   Example:
                                          – mahout itemsimilarity
                                              --input <input-hdfs-path>
                                              --output <output-hdfs-path>
                                              --tempDir <tmp-hdfs-path>
                                              -s SIMILARITY_LOGLIKELIHOOD




                                    The Truth About Mahout
                                     •   Mahout is really just an algorithm engine
                                     •   Its output is almost unusable by non-
                                         statisticians/non-data scientists
                                     •   You need a staff or a product to visualize, or
                                         make into a usable prediction model
                                     •   Investigate Predixion Software
                                          – CTO, Jamie MacLennan, used to lead SQL Server Data
                                            Mining team
                                          – Excel add-in can use Mahout remotely, visualize its output,
                                            run predictive analyses
                                          – Also integrates with SQL Server, Greenplum, MapReduce
                                          – http://www.predixionsoftware.com




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   33
SQL Server Live! Orlando 2012




                                    The “Data-Refinery” Idea
                                     •   Use Hadoop to “on-board” unstructured
                                         data, then extract manageable subsets
                                     •   Load the subsets into conventional DW/BI
                                         servers and use familiar analytics tool to
                                         examine
                                     •   This is the current rationalization of
                                         Hadoop + BI tools’ coexistence
                                     •   Will it stay this way?




                                   DRILLDOWN ON NOSQL




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   34
SQL Server Live! Orlando 2012




                             Hitting (Relational) Walls

                            •   CA
                                  – Highly-available consistency
                            •   CP
                                  – Enforced consistency
                            •   AP
                                  – Eventual consistency




                                The reality…two pivots




                                  Storage                                      Storage
                                  Methods                                      Locations
                                  • SQL (RDBMS)                                • On premises
                                  • NoSQL                                      • Cloud-hosted




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   35
SQL Server Live! Orlando 2012




                                  So many NoSQL options
                                     •   More than just the Elephant in the room
                                     •   Over 120+ types of noSQL databases




                               Flavors of NoSQL




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   36
SQL Server Live! Orlando 2012




                                Graph Database
                               Use for data with
                                     – a lot of many-to-many relationships
                                     – recursive self-joins
                                     – when your primary objective is quickly
                                       finding connections, patterns and
                                       relationships between the objects within
                                       lots of data
                                     – Examples: Neo4J, FreeBase (Google)




                                Column Database

                                •   Wide, sparse column sets
                                •   Schema-light
                                •   Examples:
                                      – Cassandra
                                      – HBase
                                      – BigTable
                                      – GAE HR DS




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   37
SQL Server Live! Orlando 2012




                                More about Column Databases

                                     •     Type A
                                            – Column-families
                                            – Non-relational
                                            – Sparse
                                            – Examples: HBase, Cassandra, xVelocity (SQL 2012
                                              BISM)
                                     •     Type B
                                            – Column-stores
                                            – Relational
                                            – Dense
                                            – Example:
                                                SQL Server 2012 Columnstore index




                                Demo - Document Database (MongoDB)
                                •   Use for data that is
                                         – document-oriented (collection of
                                           JSON documents) w/semi structured
                                           data
                                           Encodings include XML, YAML, JSON
                                           & BSON
                                         – binary forms
                                           PDF, Microsoft Office documents -- Word,
                                           Excel…)

                                •   Examples: MongoDB,
                                    CouchDB




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   38
SQL Server Live! Orlando 2012




                                    Demo
                                     MongoDB




                                Persistent Key / Value Database
                                •    Schema-less
                                •    State - Persistent
                                •    Examples
                                      – AWS DynamoDB
                                      – Azure Tables
                                      – Project Voldemort




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   39
SQL Server Live! Orlando 2012




                                Volatile Key / Value Database
                                •    Schema-less
                                •    State - Volatile
                                •    Examples
                                      – Redis
                                      – Memcahed




                                    Which type of NoSQL for which
                                    type of data?

                                    Type of Data              Type of NoSQL              Example
                                                              solution
                                    Log files                 Wide Column                HBase
                                    Product Catalogs          Key Value on disk          DynamoDB
                                    User profiles             Key Value in memory        Redis
                                    Startups                  Document                   MongoDB
                                    Social media              Graph                      Neo4j
                                    connections
                                    LOB w/Transactions        NONE! Use RDBMS            SQL Server




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   40
SQL Server Live! Orlando 2012




                                    What about the cloud?




                              Cloud-hosted NoSQL up to 50x CHEAPER




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   41
SQL Server Live! Orlando 2012




                                 Consumer Storage Buckets
                             •   Dropbox
                             •   Box
                             •   Windows SkyDrive
                             •   Google Drive
                             •   Amazon Cloud Drive
                             •   Apple iCloud




                                 Developer BLOB Storage Buckets
                             •   Amazon – S3 or Glacier
                             •   Google – Cloud Storage
                             •   Microsoft Azure BLOBS
                             •   Others




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   42
SQL Server Live! Orlando 2012




                             Cloud-hosted RDBMS
                             •    AWS RDS – SQL
                                  Server, MySQL, Oracle
                                   – Medium cost
                                   – Solid feature set, i.e.
                                     backup, snapshot
                                   – Use existing tooling
                             •    Google – MySQL
                                   – Lowest cost
                                   – Most limited RDBMS
                                     functionality
                             •    Microsoft – Windows
                                  Azure SQL Database
                                   – Highest cost
                                   – Azure VMs w/MySQL




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   43
SQL Server Live! Orlando 2012




                              Other cloud data services

                                 Hosting public datasets

                                  • Pay to read
                                  • Earn revenue by offering for read

                                 Cleaning / matching (your) data

                                  • ETL – Microsoft Data Explorer, Google Refine
                                  • Data Quality – Windows Azure Marketplace,
                                    InfoChimps, DataMarket.com




                                Cloud – RDBMS, NoSQL & Hadoop
                                                     AWS                   Google                 Microsoft
                              Cloud RDBMS            SQL Server, Oracle MySQL                     SQL Azure
                                                     / mySQL
                              NoSQL buckets          S3 or Glacier         Cloud Storage          Azure Storage
                              NoSQL databases        DynamoDB              H/R Datastore on       Azure Tables
                                                                           GAE
                              Streaming              Custom EC2            Prospective            StreamInsight &
                              Machine Learning                             Search &               Mahout with
                                                                           Prediction API         Hadoop

                              Document or            MongoDB on EC2        Freebase (g)           MongoDB on
                              Graph                                                               Windows Azure
                              Hadoop                 Elastic MapReduce MapR & GCE                 Windows Azure
                                                     using S3 & EC2                               HDInsight


                              Data sets & other      Karmasphere           Translation API        Azure DataMarket
                                                                           Full-text search




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   44
SQL Server Live! Orlando 2012




                                   Demo
                                     Amazon RDS




                                Pick your mix and then…

                                                                                        • Use Cloud Data
                                                                                          Markets
                                                                         Other          • Use Cloud ETL
                                                                        Services




                                                            RDBMS

                                           • Host locally
                                           • Host in the
                                             Cloud                                 NoSQL

                                                                  • Host locally
                                                                  • Host in the
                                                                    Cloud




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   45
SQL Server Live! Orlando 2012




                                    What about me?




                                Common DBA Tasks in NoSQL
                              RDBMS                                NoSQL
                              Import Data                          Import Data
                              Setup Security                       Setup Security
                              Perform a Backup                     Make a copy of the data
                              Restore a Database                   Move a copy to a location
                              Create an Index                      Create an Index
                              Join Tables Together                 Run MapReduce
                              Schedule a Job                       Schedule a (Cron) Job
                              Run Database Maintenance             Monitor space and resources used
                              Send an Email from SQL Server        Set up resource threshold alerts

                              Search BOL                           Interpret Documentation




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   46
SQL Server Live! Orlando 2012




                                Making Sense – Asking Questions




                                    Data Scientists…




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   47
SQL Server Live! Orlando 2012




                             Comparing…




                                          Karmasphere Studio for AWS




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   48
SQL Server Live! Orlando 2012




                                Google BigQuery w/Excel
                                •   Dremel-based service
                                      – For massive amounts of data
                                      – BigQuery currently has quota limits
                                      – SQL-like query language




                                    Demo
                                     Google Big Query




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   49
SQL Server Live! Orlando 2012




                               NoSQL To-Do List

                                 Understand CAP & types of NoSQL databases
                                  • Use NoSQL when business needs designate
                                  • Use the right type of NoSQL for your business problem
                                 Try out NoSQL on the cloud
                                  • Quick and cheap for behavioral data
                                  • Mashup cloud datasets
                                  • Good for specialized use cases, i.e. dev, test , training
                                    environments
                                 Learn noSQL access technologies
                                  • New query languages, i.e. MapReduce, R, Infer.NET
                                  • New query tools (vendor-specific) – Google Refine, Amazon
                                    Karmasphere, Microsoft Excel connectors, etc…




                                The Changing Data Landscape


                                                                                                    Other
                                                                                                   Services

                                RDBMS

                                                      NoSQL




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   50
SQL Server Live! Orlando 2012




                                    NoSQL for .NET Developers
                                     •   RavenDB
                                     •   MongoDB C#/.NET Driver
                                     •   MongoDB on Windows Azure
                                     •   CouchBase .NET Client Library
                                     •   Riak client for .NET
                                     •   AWS Toolkit for Visual Studio
                                     •   Google cloud APIs (REST-based)




SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved.   51

Contenu connexe

Tendances

Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
Andrew Brust
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
 

Tendances (20)

Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Relational vs. Non-Relational
Relational vs. Non-RelationalRelational vs. Non-Relational
Relational vs. Non-Relational
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Evolved BI with SQL Server 2012
Evolved BIwith SQL Server 2012Evolved BIwith SQL Server 2012
Evolved BI with SQL Server 2012
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data HullabalooA Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
 
Rdbms vs. no sql
Rdbms vs. no sqlRdbms vs. no sql
Rdbms vs. no sql
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 

Similaire à Big Data and NoSQL in Microsoft-Land

Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
ramazan fırın
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
Lynn Langit
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
Martin Scholl
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Ohud Saud
 

Similaire à Big Data and NoSQL in Microsoft-Land (20)

Strata Online_road_to_enterprise_data_2011
Strata Online_road_to_enterprise_data_2011Strata Online_road_to_enterprise_data_2011
Strata Online_road_to_enterprise_data_2011
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
 
A peek into the future
A peek into the futureA peek into the future
A peek into the future
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
 
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case studyBig Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Big data
Big dataBig data
Big data
 

Plus de Andrew Brust (9)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Big Data and NoSQL in Microsoft-Land

  • 1. SQL Server Live! Orlando 2012 Big Data and NoSQL in Microsoft-Land Andrew Brust and Lynn Langit Blue Badge Insights & Data Wrangler Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1
  • 2. SQL Server Live! Orlando 2012 Andrew’s New Blog (bit.ly/bigondata) Meet Lynn • CEO and Founder, Lynn Langit consulting • Former Microsoft Evangelist (4 years) • Google Developer Expert • MongoDB Master • MCT 13 years – 7 certifications • Cloudera Certified Developer • MSDN Magazine articles – SQL Azure – Hadoop on Azure – MongoDB on Azure • www.LynnLangit.com • @LynnLangit SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2
  • 3. SQL Server Live! Orlando 2012 Lynn’s YouTube Channel • recipes) www.TeachingKidsProgramming.org • Free Courseware ( • Do a Recipe  Teach a Kid (Ages 10 ++) • Java or Microsoft SmallBasic  SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3
  • 4. SQL Server Live! Orlando 2012 Read all about it! Agenda • Overview / Landscape – Big Data, and Hadoop – NoSQL – The Big Data-NoSQL Intersection • Drilldown on Big Data • Drilldown on NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4
  • 5. SQL Server Live! Orlando 2012 What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems BigData = Exponentially More Data • Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5
  • 6. SQL Server Live! Orlando 2012 BigData = ‘Next State’ Questions • What could happen? • Why didn’t this happen? Collecting • When will the next new thing Behavioral happen? data • What will the next new thing be? • What happens? What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer • Reducer aggregates; one output per key, with value • Map and Reduce code natively written as Java functions SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6
  • 7. SQL Server Live! Orlando 2012 MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheet SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7
  • 8. SQL Server Live! Orlando 2012 What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – “Shared Nothing” • BUT: Immutable – Files can only be written to once – So updates require drop + re-write (slow) – You can append though – Like a DVD/CD-ROM Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8
  • 9. SQL Server Live! Orlando 2012 Example Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Can be near immediate Has latency (due to batch processing) Time Just-in-time Schema • When looking at unstructured data, schema is imposed at query time • Schema is context specific – If scanning a book, are the values words, lines, or pages? – Are notes a single field, or is each word value? – Are date and time two fields or one? – Are street, city, state, zip separate or one value? – Pig and Hive let you determine this at query time – So does the Map function in MapReduce code SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9
  • 10. SQL Server Live! Orlando 2012 What’s HBase? • A Wide-Column Store NoSQL database • Modeled after Google BigTable • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10
  • 11. SQL Server Live! Orlando 2012 NoSQL Confusion • Many ‘flavors’ of NoSQL data stores • Easiest to group by functionality, but… – Dividing lines are not clear or consistent • NoSQL choice(s) driven by many factors – Type of data – Quantity of tool – Knowledge of technical staff – Product maturity – Tooling So much wrong information People are Everything is religious about ‘new’ data storage Lots of ‘Try’ before incorrect you ‘buy’ (or information use) Watch out for Confusion over over vendor simplification offerings SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11
  • 12. SQL Server Live! Orlando 2012 Common NoSQL Misconceptions Problems Solutions Everything is ‘new’ People are religious about ‘Try’ before you ‘buy’ (or use) data storage Leverage NoSQL Open source is always communities cheaper Add NoSQL to existing Cloud is always cheaper RDBMS solution Replace RDBMS with NoSQL NoSQL + Big Data • HBase and Cassandra work with Hadoop, are NoSQL databases • MongoDB brands itself a Big Data technology • Couchbase does too • Just-in-time schema • MapReduce in MongoDB, others • Hadoop and most NoSQL DBs are partitioned, scale-out technologies • It’s all about analytics on semi- or un- structured data SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12
  • 13. SQL Server Live! Orlando 2012 DRILLDOWN ON BIG DATA The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13
  • 14. SQL Server Live! Orlando 2012 What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14
  • 15. SQL Server Live! Orlando 2012 Microsoft HDInsight • Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows • Windows Azure HDInsight and Microsoft HDInsight (for Windows Server) – Single node preview runs on Windows client • Includes ODBC Driver for Hive – And Excel Add-In that uses it • JavaScript MapReduce framework • Contribute it all back to open source Apache Project Amenities for Visual Studio/.NET MRLib (NuGet Package) MR code in C#, HadoopJob, LINQ to Hive MapperBase, ReducerBase Hortonworks Data Platform for Windows OdbcClient + Debugging Hive ODBC Driver Deployment SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15
  • 16. SQL Server Live! Orlando 2012 Some ways to work • Microsoft HDInsight – Cloud: go to www.hadooponazure.com, request invite – Local: Download Microsoft HDInsight Runs on just about anything, including Windows XP Get it via the Web Platform installer (WebPI) – Both are free for now; Azure HDInsight will be fee-based when RTM • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard – Cheap for experimenting, but not free • Cloudera CDH VM image – Download as .tar.gz file – “Un-tar” (can use WinRAR, 7zip) – Run via VMWare Player or Virtual Box – Everything’s free Some ways to work HDInsight EMR CDH 4 SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16
  • 17. SQL Server Live! Orlando 2012 Microsoft HDInsight • Much simpler than the others • Browser-based portal – Launch MapReduce jobs – Azure: Provisioning cluster, managing ports, gather external data • Interactive JavaScript & Hive console – JS: HDFS, Pig, light data visualization – Hive commands and metadata discovery – New console coming • Desktop Shortcuts: – Command window, MapReduce, Name Node status in browser – Azure: from portal page you can RDP directly to Hadoop head node for these desktop shortcuts Windows Azure HDInsight SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17
  • 18. SQL Server Live! Orlando 2012 Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI A batch file can work very well here – Setup and run SSH/PuTTY – Work interactively at command line Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI  • Create credentials.json in EMR CLI folder – Associate with same region as where key pair created SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18
  • 19. SQL Server Live! Orlando 2012 Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not work Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19
  • 20. SQL Server Live! Orlando 2012 Amazon Elastic MapReduce Cloudera CDH4 Virtual Machine • Get it for free, in VMWare and Virtual Box versions. – VMWare player and Virtual Box are free too • Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP. • Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to: – http://192.168.1.59:8888 • Can also use browser in VM and hit: – http://localhost:8888 • Work in “Hue”… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20
  • 21. SQL Server Live! Orlando 2012 Hue • Browser based UI, with front ends for: – HDFS (w/ upload & download) – MapReduce job creation and monitoring – Hive (“Beeswax”) • And in-browser command line shells for: – HBase – Pig (“Grunt”) Impala: What it Is • Distributed SQL query engine over Hadoop cluster • Announced at Strata/Hadoop World in NYC on October 24th • In Beta, as part of CDH 4.1 • Works with HDFS and Hive data • Compatible with HiveQL and Hive drivers – Query with Beeswax SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21
  • 22. SQL Server Live! Orlando 2012 Impala: What it’s Not • Impala is not Hive – Hive converts HiveQL to Java MapReduce code and executes it in batch mode – Impala executes query interactively over the data – Brings BI tools and Hadoop closer together • Impala is not an Apache Software Foundation project – Though it is open source and Apache-licensed, but it’s still incubated by Cloudera – Only in CDH Cloudera CDH4 SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22
  • 23. SQL Server Live! Orlando 2012 Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params Hadoop (directly) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23
  • 24. SQL Server Live! Orlando 2012 HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell • Moreover, – Interesting HBase work can be done in MapReduce, Pig HBase Examples • create 't1', 'f1', 'f2', 'f3' • describe 't1' • alter 't1', {NAME => 'f1', VERSIONS => 5} • put 't1', 'r1', 'c1:f1', 'value' • get 't1', 'r1' • count 't1' SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24
  • 25. SQL Server Live! Orlando 2012 HBase Submitting, Running and Monitoring Jobs • Upload a JAR • Use Streaming – Use other languages (i.e. other than Java) to write MapReduce code – Python is popular option – Any executable works, even C# console apps – On MS HDInsight, JavaScript works too – Still uses a JAR file: streaming.jar • Run at command line (passing JAR name and params) or use GUI SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25
  • 26. SQL Server Live! Orlando 2012 Running MapReduce Jobs Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26
  • 27. SQL Server Live! Orlando 2012 Hive, Continued • Load data from flat HDFS files – LOAD DATA [LOCAL] INPATH 'myfile' INTO TABLE mytable; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code Excel Add-In for Hive SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27
  • 28. SQL Server Live! Orlando 2012 Hive Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS or text at command line console – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scripts SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28
  • 29. SQL Server Live! Orlando 2012 Example • A = LOAD 'myfile' AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO 'output'; Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = file or data set – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15; • Access Hbase – CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp u:','-loadKey -returnTuple'); SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29
  • 30. SQL Server Live! Orlando 2012 Pig Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column> SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30
  • 31. SQL Server Live! Orlando 2012 Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>" Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, file SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31
  • 32. SQL Server Live! Orlando 2012 Flume NG (next generation) • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- Shift SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32
  • 33. SQL Server Live! Orlando 2012 Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD The Truth About Mahout • Mahout is really just an algorithm engine • Its output is almost unusable by non- statisticians/non-data scientists • You need a staff or a product to visualize, or make into a usable prediction model • Investigate Predixion Software – CTO, Jamie MacLennan, used to lead SQL Server Data Mining team – Excel add-in can use Mahout remotely, visualize its output, run predictive analyses – Also integrates with SQL Server, Greenplum, MapReduce – http://www.predixionsoftware.com SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33
  • 34. SQL Server Live! Orlando 2012 The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way? DRILLDOWN ON NOSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34
  • 35. SQL Server Live! Orlando 2012 Hitting (Relational) Walls • CA – Highly-available consistency • CP – Enforced consistency • AP – Eventual consistency The reality…two pivots Storage Storage Methods Locations • SQL (RDBMS) • On premises • NoSQL • Cloud-hosted SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35
  • 36. SQL Server Live! Orlando 2012 So many NoSQL options • More than just the Elephant in the room • Over 120+ types of noSQL databases Flavors of NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36
  • 37. SQL Server Live! Orlando 2012 Graph Database Use for data with – a lot of many-to-many relationships – recursive self-joins – when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data – Examples: Neo4J, FreeBase (Google) Column Database • Wide, sparse column sets • Schema-light • Examples: – Cassandra – HBase – BigTable – GAE HR DS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37
  • 38. SQL Server Live! Orlando 2012 More about Column Databases • Type A – Column-families – Non-relational – Sparse – Examples: HBase, Cassandra, xVelocity (SQL 2012 BISM) • Type B – Column-stores – Relational – Dense – Example: SQL Server 2012 Columnstore index Demo - Document Database (MongoDB) • Use for data that is – document-oriented (collection of JSON documents) w/semi structured data Encodings include XML, YAML, JSON & BSON – binary forms PDF, Microsoft Office documents -- Word, Excel…) • Examples: MongoDB, CouchDB SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38
  • 39. SQL Server Live! Orlando 2012 Demo MongoDB Persistent Key / Value Database • Schema-less • State - Persistent • Examples – AWS DynamoDB – Azure Tables – Project Voldemort SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39
  • 40. SQL Server Live! Orlando 2012 Volatile Key / Value Database • Schema-less • State - Volatile • Examples – Redis – Memcahed Which type of NoSQL for which type of data? Type of Data Type of NoSQL Example solution Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media Graph Neo4j connections LOB w/Transactions NONE! Use RDBMS SQL Server SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40
  • 41. SQL Server Live! Orlando 2012 What about the cloud? Cloud-hosted NoSQL up to 50x CHEAPER SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41
  • 42. SQL Server Live! Orlando 2012 Consumer Storage Buckets • Dropbox • Box • Windows SkyDrive • Google Drive • Amazon Cloud Drive • Apple iCloud Developer BLOB Storage Buckets • Amazon – S3 or Glacier • Google – Cloud Storage • Microsoft Azure BLOBS • Others SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42
  • 43. SQL Server Live! Orlando 2012 Cloud-hosted RDBMS • AWS RDS – SQL Server, MySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling • Google – MySQL – Lowest cost – Most limited RDBMS functionality • Microsoft – Windows Azure SQL Database – Highest cost – Azure VMs w/MySQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43
  • 44. SQL Server Live! Orlando 2012 Other cloud data services Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Marketplace, InfoChimps, DataMarket.com Cloud – RDBMS, NoSQL & Hadoop AWS Google Microsoft Cloud RDBMS SQL Server, Oracle MySQL SQL Azure / mySQL NoSQL buckets S3 or Glacier Cloud Storage Azure Storage NoSQL databases DynamoDB H/R Datastore on Azure Tables GAE Streaming Custom EC2 Prospective StreamInsight & Machine Learning Search & Mahout with Prediction API Hadoop Document or MongoDB on EC2 Freebase (g) MongoDB on Graph Windows Azure Hadoop Elastic MapReduce MapR & GCE Windows Azure using S3 & EC2 HDInsight Data sets & other Karmasphere Translation API Azure DataMarket Full-text search SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44
  • 45. SQL Server Live! Orlando 2012 Demo Amazon RDS Pick your mix and then… • Use Cloud Data Markets Other • Use Cloud ETL Services RDBMS • Host locally • Host in the Cloud NoSQL • Host locally • Host in the Cloud SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45
  • 46. SQL Server Live! Orlando 2012 What about me? Common DBA Tasks in NoSQL RDBMS NoSQL Import Data Import Data Setup Security Setup Security Perform a Backup Make a copy of the data Restore a Database Move a copy to a location Create an Index Create an Index Join Tables Together Run MapReduce Schedule a Job Schedule a (Cron) Job Run Database Maintenance Monitor space and resources used Send an Email from SQL Server Set up resource threshold alerts Search BOL Interpret Documentation SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46
  • 47. SQL Server Live! Orlando 2012 Making Sense – Asking Questions Data Scientists… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47
  • 48. SQL Server Live! Orlando 2012 Comparing… Karmasphere Studio for AWS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48
  • 49. SQL Server Live! Orlando 2012 Google BigQuery w/Excel • Dremel-based service – For massive amounts of data – BigQuery currently has quota limits – SQL-like query language Demo Google Big Query SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49
  • 50. SQL Server Live! Orlando 2012 NoSQL To-Do List Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc… The Changing Data Landscape Other Services RDBMS NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50
  • 51. SQL Server Live! Orlando 2012 NoSQL for .NET Developers • RavenDB • MongoDB C#/.NET Driver • MongoDB on Windows Azure • CouchBase .NET Client Library • Riak client for .NET • AWS Toolkit for Visual Studio • Google cloud APIs (REST-based) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51