Soumettre la recherche
Mettre en ligne
Big Data and NoSQL in Microsoft-Land
•
2 j'aime
•
3,683 vues
Andrew Brust
Suivre
SQL Server Live! Orlando 2012
Lire moins
Lire la suite
Signaler
Partager
Signaler
Partager
1 sur 51
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
Andrew Brust
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
Big Data on the Microsoft Platform
Big Data on the Microsoft Platform
Andrew Brust
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
NoSQL: An Analysis
NoSQL: An Analysis
Andrew Brust
Recommandé
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
Andrew Brust
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
Big Data on the Microsoft Platform
Big Data on the Microsoft Platform
Andrew Brust
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
NoSQL: An Analysis
NoSQL: An Analysis
Andrew Brust
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Andrew Brust
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
Relational vs. Non-Relational
Relational vs. Non-Relational
PostgreSQL Experts, Inc.
NoSQL databases and managing big data
NoSQL databases and managing big data
Steven Francia
Relational and non relational database 7
Relational and non relational database 7
abdulrahmanhelan
Relational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
Nonrelational Databases
Nonrelational Databases
Udi Bauman
Non relational databases-no sql
Non relational databases-no sql
Ram kumar
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
Non Relational Databases
Non Relational Databases
Chris Baglieri
RDBMS vs NoSQL
RDBMS vs NoSQL
Murat Çakal
Evolved BIwith SQL Server 2012
Evolved BIwith SQL Server 2012
Andrew Brust
Sql vs NoSQL
Sql vs NoSQL
RTigger
1. introduction to no sql
1. introduction to no sql
Anuja Gunale
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
Rdbms vs. no sql
Rdbms vs. no sql
Amar Jagdale
NoSQL Seminer
NoSQL Seminer
Partha Das
Strata Online_road_to_enterprise_data_2011
Strata Online_road_to_enterprise_data_2011
Lynn Langit
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Gigaom
Contenu connexe
Tendances
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Andrew Brust
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
Relational vs. Non-Relational
Relational vs. Non-Relational
PostgreSQL Experts, Inc.
NoSQL databases and managing big data
NoSQL databases and managing big data
Steven Francia
Relational and non relational database 7
Relational and non relational database 7
abdulrahmanhelan
Relational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
Nonrelational Databases
Nonrelational Databases
Udi Bauman
Non relational databases-no sql
Non relational databases-no sql
Ram kumar
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
Non Relational Databases
Non Relational Databases
Chris Baglieri
RDBMS vs NoSQL
RDBMS vs NoSQL
Murat Çakal
Evolved BIwith SQL Server 2012
Evolved BIwith SQL Server 2012
Andrew Brust
Sql vs NoSQL
Sql vs NoSQL
RTigger
1. introduction to no sql
1. introduction to no sql
Anuja Gunale
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Andrew Brust
Rdbms vs. no sql
Rdbms vs. no sql
Amar Jagdale
NoSQL Seminer
NoSQL Seminer
Partha Das
Tendances
(20)
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Relational vs. Non-Relational
Relational vs. Non-Relational
NoSQL databases and managing big data
NoSQL databases and managing big data
Relational and non relational database 7
Relational and non relational database 7
Relational databases vs Non-relational databases
Relational databases vs Non-relational databases
Intro to Big Data and NoSQL
Intro to Big Data and NoSQL
Nonrelational Databases
Nonrelational Databases
Non relational databases-no sql
Non relational databases-no sql
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
Non Relational Databases
Non Relational Databases
RDBMS vs NoSQL
RDBMS vs NoSQL
Evolved BIwith SQL Server 2012
Evolved BIwith SQL Server 2012
Sql vs NoSQL
Sql vs NoSQL
1. introduction to no sql
1. introduction to no sql
A Practical Look at the NOSQL and Big Data Hullabaloo
A Practical Look at the NOSQL and Big Data Hullabaloo
Rdbms vs. no sql
Rdbms vs. no sql
NoSQL Seminer
NoSQL Seminer
Similaire à Big Data and NoSQL in Microsoft-Land
Strata Online_road_to_enterprise_data_2011
Strata Online_road_to_enterprise_data_2011
Lynn Langit
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Gigaom
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
ramazan fırın
Introduction to Big Data
Introduction to Big Data
Roi Blanco
BigData.pptx
BigData.pptx
vidhi171881
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
MongoDB
NoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
Lynn Langit
Large scale computing
Large scale computing
Bhupesh Bansal
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
Martin Scholl
A peek into the future
A peek into the future
Prateek Chauhan
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Gigaom
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
Stanford University
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
Sharjeel Imtiaz
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Ohud Saud
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
Neo4j in Depth
Neo4j in Depth
Max De Marzi
Big data
Big data
nikki135
Similaire à Big Data and NoSQL in Microsoft-Land
(20)
Strata Online_road_to_enterprise_data_2011
Strata Online_road_to_enterprise_data_2011
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
Introduction to Big Data
Introduction to Big Data
BigData.pptx
BigData.pptx
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
NoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
Large scale computing
Large scale computing
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
A peek into the future
A peek into the future
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
Big Data with IOT approach and trends with case study
Big Data with IOT approach and trends with case study
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
Neo4j in Depth
Neo4j in Depth
Big data
Big data
Plus de Andrew Brust
Azure ml screen grabs
Azure ml screen grabs
Andrew Brust
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Andrew Brust
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
Andrew Brust
Brust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
Andrew Brust
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
Andrew Brust
Grasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
Andrew Brust
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
Andrew Brust
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
Andrew Brust
Plus de Andrew Brust
(9)
Azure ml screen grabs
Azure ml screen grabs
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
Brust hadoopecosystem
Brust hadoopecosystem
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
Grasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
SQL Server Denali: BI on Your Terms
SQL Server Denali: BI on Your Terms
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
Big Data and NoSQL in Microsoft-Land
1.
SQL Server Live!
Orlando 2012 Big Data and NoSQL in Microsoft-Land Andrew Brust and Lynn Langit Blue Badge Insights & Data Wrangler Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 1
2.
SQL Server Live!
Orlando 2012 Andrew’s New Blog (bit.ly/bigondata) Meet Lynn • CEO and Founder, Lynn Langit consulting • Former Microsoft Evangelist (4 years) • Google Developer Expert • MongoDB Master • MCT 13 years – 7 certifications • Cloudera Certified Developer • MSDN Magazine articles – SQL Azure – Hadoop on Azure – MongoDB on Azure • www.LynnLangit.com • @LynnLangit SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 2
3.
SQL Server Live!
Orlando 2012 Lynn’s YouTube Channel • recipes) www.TeachingKidsProgramming.org • Free Courseware ( • Do a Recipe Teach a Kid (Ages 10 ++) • Java or Microsoft SmallBasic SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 3
4.
SQL Server Live!
Orlando 2012 Read all about it! Agenda • Overview / Landscape – Big Data, and Hadoop – NoSQL – The Big Data-NoSQL Intersection • Drilldown on Big Data • Drilldown on NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 4
5.
SQL Server Live!
Orlando 2012 What is Big Data? • 100s of TB into PB and higher • Involving data from: financial data, sensors, web logs, social media, etc. • Parallel processing often involved – Hadoop is emblematic, but other technologies are Big Data too • Processing of data sets too large for transactional databases – Analyzing interactions, rather than transactions – The three V’s: Volume, Velocity, Variety • Big Data tech sometimes imposed on small data problems BigData = Exponentially More Data • Retail Example -> ‘Feedback Economy’ – Number of transactions – Number of behaviors (collected every minute) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 5
6.
SQL Server Live!
Orlando 2012 BigData = ‘Next State’ Questions • What could happen? • Why didn’t this happen? Collecting • When will the next new thing Behavioral happen? data • What will the next new thing be? • What happens? What’s MapReduce? • “Big” input data as key-value pair series • Partition the data and send to mappers (nodes in cluster) • Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducer • Reducer aggregates; one output per key, with value • Map and Reduce code natively written as Java functions SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 6
7.
SQL Server Live!
Orlando 2012 MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheet SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 7
8.
SQL Server Live!
Orlando 2012 What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – “Shared Nothing” • BUT: Immutable – Files can only be written to once – So updates require drop + re-write (slow) – You can append though – Like a DVD/CD-ROM Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 8
9.
SQL Server Live!
Orlando 2012 Example Comparison: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Can be near immediate Has latency (due to batch processing) Time Just-in-time Schema • When looking at unstructured data, schema is imposed at query time • Schema is context specific – If scanning a book, are the values words, lines, or pages? – Are notes a single field, or is each word value? – Are date and time two fields or one? – Are street, city, state, zip separate or one value? – Pig and Hive let you determine this at query time – So does the Map function in MapReduce code SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 9
10.
SQL Server Live!
Orlando 2012 What’s HBase? • A Wide-Column Store NoSQL database • Modeled after Google BigTable • Uses HDFS – Therefore, Hadoop-compatible • Hadoop often used with HBase – But you can use either without the other SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 10
11.
SQL Server Live!
Orlando 2012 NoSQL Confusion • Many ‘flavors’ of NoSQL data stores • Easiest to group by functionality, but… – Dividing lines are not clear or consistent • NoSQL choice(s) driven by many factors – Type of data – Quantity of tool – Knowledge of technical staff – Product maturity – Tooling So much wrong information People are Everything is religious about ‘new’ data storage Lots of ‘Try’ before incorrect you ‘buy’ (or information use) Watch out for Confusion over over vendor simplification offerings SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 11
12.
SQL Server Live!
Orlando 2012 Common NoSQL Misconceptions Problems Solutions Everything is ‘new’ People are religious about ‘Try’ before you ‘buy’ (or use) data storage Leverage NoSQL Open source is always communities cheaper Add NoSQL to existing Cloud is always cheaper RDBMS solution Replace RDBMS with NoSQL NoSQL + Big Data • HBase and Cassandra work with Hadoop, are NoSQL databases • MongoDB brands itself a Big Data technology • Couchbase does too • Just-in-time schema • MapReduce in MongoDB, others • Hadoop and most NoSQL DBs are partitioned, scale-out technologies • It’s all about analytics on semi- or un- structured data SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 12
13.
SQL Server Live!
Orlando 2012 DRILLDOWN ON BIG DATA The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 13
14.
SQL Server Live!
Orlando 2012 What’s Hive? • Began as Hadoop sub-project – Now top-level Apache project • Provides a SQL-like (“HiveQL”) abstraction over MapReduce • Has its own HDFS table file format (and it’s fully schema-bound) • Can also work over HBase • Acts as a bridge to many BI products which expect tabular data Hadoop Distributions • Cloudera • Hortonworks – HCatalog: Hive/Pig/MR Interop • MapR – Network File System replaces HDFS • IBM InfoSphere BigInsights – HDFS<->DB2 integration • And now Microsoft… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 14
15.
SQL Server Live!
Orlando 2012 Microsoft HDInsight • Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows • Windows Azure HDInsight and Microsoft HDInsight (for Windows Server) – Single node preview runs on Windows client • Includes ODBC Driver for Hive – And Excel Add-In that uses it • JavaScript MapReduce framework • Contribute it all back to open source Apache Project Amenities for Visual Studio/.NET MRLib (NuGet Package) MR code in C#, HadoopJob, LINQ to Hive MapperBase, ReducerBase Hortonworks Data Platform for Windows OdbcClient + Debugging Hive ODBC Driver Deployment SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 15
16.
SQL Server Live!
Orlando 2012 Some ways to work • Microsoft HDInsight – Cloud: go to www.hadooponazure.com, request invite – Local: Download Microsoft HDInsight Runs on just about anything, including Windows XP Get it via the Web Platform installer (WebPI) – Both are free for now; Azure HDInsight will be fee-based when RTM • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard – Cheap for experimenting, but not free • Cloudera CDH VM image – Download as .tar.gz file – “Un-tar” (can use WinRAR, 7zip) – Run via VMWare Player or Virtual Box – Everything’s free Some ways to work HDInsight EMR CDH 4 SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 16
17.
SQL Server Live!
Orlando 2012 Microsoft HDInsight • Much simpler than the others • Browser-based portal – Launch MapReduce jobs – Azure: Provisioning cluster, managing ports, gather external data • Interactive JavaScript & Hive console – JS: HDFS, Pig, light data visualization – Hive commands and metadata discovery – New console coming • Desktop Shortcuts: – Command window, MapReduce, Name Node status in browser – Azure: from portal page you can RDP directly to Hadoop head node for these desktop shortcuts Windows Azure HDInsight SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 17
18.
SQL Server Live!
Orlando 2012 Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI A batch file can work very well here – Setup and run SSH/PuTTY – Work interactively at command line Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI • Create credentials.json in EMR CLI folder – Associate with same region as where key pair created SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 18
19.
SQL Server Live!
Orlando 2012 Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not work Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 19
20.
SQL Server Live!
Orlando 2012 Amazon Elastic MapReduce Cloudera CDH4 Virtual Machine • Get it for free, in VMWare and Virtual Box versions. – VMWare player and Virtual Box are free too • Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP. • Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to: – http://192.168.1.59:8888 • Can also use browser in VM and hit: – http://localhost:8888 • Work in “Hue”… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 20
21.
SQL Server Live!
Orlando 2012 Hue • Browser based UI, with front ends for: – HDFS (w/ upload & download) – MapReduce job creation and monitoring – Hive (“Beeswax”) • And in-browser command line shells for: – HBase – Pig (“Grunt”) Impala: What it Is • Distributed SQL query engine over Hadoop cluster • Announced at Strata/Hadoop World in NYC on October 24th • In Beta, as part of CDH 4.1 • Works with HDFS and Hive data • Compatible with HiveQL and Hive drivers – Query with Beeswax SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 21
22.
SQL Server Live!
Orlando 2012 Impala: What it’s Not • Impala is not Hive – Hive converts HiveQL to Java MapReduce code and executes it in batch mode – Impala executes query interactively over the data – Brings BI tools and Hadoop closer together • Impala is not an Apache Software Foundation project – Though it is open source and Apache-licensed, but it’s still incubated by Cloudera – Only in CDH Cloudera CDH4 SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 22
23.
SQL Server Live!
Orlando 2012 Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params Hadoop (directly) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 23
24.
SQL Server Live!
Orlando 2012 HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell • Moreover, – Interesting HBase work can be done in MapReduce, Pig HBase Examples • create 't1', 'f1', 'f2', 'f3' • describe 't1' • alter 't1', {NAME => 'f1', VERSIONS => 5} • put 't1', 'r1', 'c1:f1', 'value' • get 't1', 'r1' • count 't1' SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 24
25.
SQL Server Live!
Orlando 2012 HBase Submitting, Running and Monitoring Jobs • Upload a JAR • Use Streaming – Use other languages (i.e. other than Java) to write MapReduce code – Python is popular option – Any executable works, even C# console apps – On MS HDInsight, JavaScript works too – Still uses a JAR file: streaming.jar • Run at command line (passing JAR name and params) or use GUI SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 25
26.
SQL Server Live!
Orlando 2012 Running MapReduce Jobs Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 26
27.
SQL Server Live!
Orlando 2012 Hive, Continued • Load data from flat HDFS files – LOAD DATA [LOCAL] INPATH 'myfile' INTO TABLE mytable; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce steps utilizing Java or streaming code Excel Add-In for Hive SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 27
28.
SQL Server Live!
Orlando 2012 Hive Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS or text at command line console – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scripts SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 28
29.
SQL Server Live!
Orlando 2012 Example • A = LOAD 'myfile' AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO 'output'; Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = file or data set – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15; • Access Hbase – CpuMetrics = LOAD 'hbase://SystemMetrics' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp u:','-loadKey -returnTuple'); SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 29
30.
SQL Server Live!
Orlando 2012 Pig Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column> SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 30
31.
SQL Server Live!
Orlando 2012 Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>" Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, file SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 31
32.
SQL Server Live!
Orlando 2012 Flume NG (next generation) • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- Shift SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 32
33.
SQL Server Live!
Orlando 2012 Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD The Truth About Mahout • Mahout is really just an algorithm engine • Its output is almost unusable by non- statisticians/non-data scientists • You need a staff or a product to visualize, or make into a usable prediction model • Investigate Predixion Software – CTO, Jamie MacLennan, used to lead SQL Server Data Mining team – Excel add-in can use Mahout remotely, visualize its output, run predictive analyses – Also integrates with SQL Server, Greenplum, MapReduce – http://www.predixionsoftware.com SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 33
34.
SQL Server Live!
Orlando 2012 The “Data-Refinery” Idea • Use Hadoop to “on-board” unstructured data, then extract manageable subsets • Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine • This is the current rationalization of Hadoop + BI tools’ coexistence • Will it stay this way? DRILLDOWN ON NOSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 34
35.
SQL Server Live!
Orlando 2012 Hitting (Relational) Walls • CA – Highly-available consistency • CP – Enforced consistency • AP – Eventual consistency The reality…two pivots Storage Storage Methods Locations • SQL (RDBMS) • On premises • NoSQL • Cloud-hosted SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 35
36.
SQL Server Live!
Orlando 2012 So many NoSQL options • More than just the Elephant in the room • Over 120+ types of noSQL databases Flavors of NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 36
37.
SQL Server Live!
Orlando 2012 Graph Database Use for data with – a lot of many-to-many relationships – recursive self-joins – when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data – Examples: Neo4J, FreeBase (Google) Column Database • Wide, sparse column sets • Schema-light • Examples: – Cassandra – HBase – BigTable – GAE HR DS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 37
38.
SQL Server Live!
Orlando 2012 More about Column Databases • Type A – Column-families – Non-relational – Sparse – Examples: HBase, Cassandra, xVelocity (SQL 2012 BISM) • Type B – Column-stores – Relational – Dense – Example: SQL Server 2012 Columnstore index Demo - Document Database (MongoDB) • Use for data that is – document-oriented (collection of JSON documents) w/semi structured data Encodings include XML, YAML, JSON & BSON – binary forms PDF, Microsoft Office documents -- Word, Excel…) • Examples: MongoDB, CouchDB SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 38
39.
SQL Server Live!
Orlando 2012 Demo MongoDB Persistent Key / Value Database • Schema-less • State - Persistent • Examples – AWS DynamoDB – Azure Tables – Project Voldemort SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 39
40.
SQL Server Live!
Orlando 2012 Volatile Key / Value Database • Schema-less • State - Volatile • Examples – Redis – Memcahed Which type of NoSQL for which type of data? Type of Data Type of NoSQL Example solution Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media Graph Neo4j connections LOB w/Transactions NONE! Use RDBMS SQL Server SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 40
41.
SQL Server Live!
Orlando 2012 What about the cloud? Cloud-hosted NoSQL up to 50x CHEAPER SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 41
42.
SQL Server Live!
Orlando 2012 Consumer Storage Buckets • Dropbox • Box • Windows SkyDrive • Google Drive • Amazon Cloud Drive • Apple iCloud Developer BLOB Storage Buckets • Amazon – S3 or Glacier • Google – Cloud Storage • Microsoft Azure BLOBS • Others SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 42
43.
SQL Server Live!
Orlando 2012 Cloud-hosted RDBMS • AWS RDS – SQL Server, MySQL, Oracle – Medium cost – Solid feature set, i.e. backup, snapshot – Use existing tooling • Google – MySQL – Lowest cost – Most limited RDBMS functionality • Microsoft – Windows Azure SQL Database – Highest cost – Azure VMs w/MySQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 43
44.
SQL Server Live!
Orlando 2012 Other cloud data services Hosting public datasets • Pay to read • Earn revenue by offering for read Cleaning / matching (your) data • ETL – Microsoft Data Explorer, Google Refine • Data Quality – Windows Azure Marketplace, InfoChimps, DataMarket.com Cloud – RDBMS, NoSQL & Hadoop AWS Google Microsoft Cloud RDBMS SQL Server, Oracle MySQL SQL Azure / mySQL NoSQL buckets S3 or Glacier Cloud Storage Azure Storage NoSQL databases DynamoDB H/R Datastore on Azure Tables GAE Streaming Custom EC2 Prospective StreamInsight & Machine Learning Search & Mahout with Prediction API Hadoop Document or MongoDB on EC2 Freebase (g) MongoDB on Graph Windows Azure Hadoop Elastic MapReduce MapR & GCE Windows Azure using S3 & EC2 HDInsight Data sets & other Karmasphere Translation API Azure DataMarket Full-text search SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 44
45.
SQL Server Live!
Orlando 2012 Demo Amazon RDS Pick your mix and then… • Use Cloud Data Markets Other • Use Cloud ETL Services RDBMS • Host locally • Host in the Cloud NoSQL • Host locally • Host in the Cloud SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 45
46.
SQL Server Live!
Orlando 2012 What about me? Common DBA Tasks in NoSQL RDBMS NoSQL Import Data Import Data Setup Security Setup Security Perform a Backup Make a copy of the data Restore a Database Move a copy to a location Create an Index Create an Index Join Tables Together Run MapReduce Schedule a Job Schedule a (Cron) Job Run Database Maintenance Monitor space and resources used Send an Email from SQL Server Set up resource threshold alerts Search BOL Interpret Documentation SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 46
47.
SQL Server Live!
Orlando 2012 Making Sense – Asking Questions Data Scientists… SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 47
48.
SQL Server Live!
Orlando 2012 Comparing… Karmasphere Studio for AWS SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 48
49.
SQL Server Live!
Orlando 2012 Google BigQuery w/Excel • Dremel-based service – For massive amounts of data – BigQuery currently has quota limits – SQL-like query language Demo Google Big Query SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 49
50.
SQL Server Live!
Orlando 2012 NoSQL To-Do List Understand CAP & types of NoSQL databases • Use NoSQL when business needs designate • Use the right type of NoSQL for your business problem Try out NoSQL on the cloud • Quick and cheap for behavioral data • Mashup cloud datasets • Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies • New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc… The Changing Data Landscape Other Services RDBMS NoSQL SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 50
51.
SQL Server Live!
Orlando 2012 NoSQL for .NET Developers • RavenDB • MongoDB C#/.NET Driver • MongoDB on Windows Azure • CouchBase .NET Client Library • Riak client for .NET • AWS Toolkit for Visual Studio • Google cloud APIs (REST-based) SQF2 - Workshop: Big Data and NoSQL in Microsoft-Land - Andrew Brust and Lynn Langit © 2012 SQL Server Live! All rights reserved. 51
Télécharger maintenant