Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×
Chargement dans…3

Consultez-les par la suite

1 sur 24 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (19)


Similaire à SQL Server 2012 and Big Data (20)

Plus par Microsoft TechNet - Belgium and Luxembourg (20)


Plus récents (20)

SQL Server 2012 and Big Data

  1. 1. SQL SERVER 2012 AND BIG DATA Hadoop Connectors for SQL Server
  2. 2. TECHNICALLY – WHAT IS HADOOP • Hadoop consists of two key services: • Data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce.
  3. 3. HADOOP IS AN ENTIRE ECOSYSTEM • Hbase as database • Hive as a Data Warehouse • Pig as the query language • Built on top of Hadoop and the Map-Reduce framework.
  4. 4. HDFS • HDFS is designed to scale seamlessly • That‟s it‟s strength! • Scaling horizontally is non-trivial in most cases. • HDFS scales by throwing more hardware at it. • A lot of it! • HDFS is asynchronous • Is what links Hadoop to Cloud computing.
  5. 5. DIFFERENCES • SQL Server & Windows 2008 R2′s NTFS? • Data is not stored in the traditional table column format. • HDFS supports only forward only parsing • Databases built on HDFS don‟t guarantee ACID properties • Taking code to the data • SQL Server scales better vertically
  6. 6. UNSTRUCTURED DATA • Doesn‟t know/care about column names, column data types, column sizes or even number of columns. • Data is stored in delimited flat files • You‟re on your own with respect to data cleansing • Data input in Hadoop is as simple as loading your data file into HDFS • It‟s very close to copying files on an OS.
  7. 7. NO SQL, NO TABLES, NO COLUMNS NO DATA? • Write code to do Map-Reduce • You have to write code to get data • The best way to get data • write code that calls the MapReduce framework to slices and dices the stored data • Step 1 is Map and Step 2 is Reduce.
  8. 8. MAP (REDUCE) • Mapping • Pick your selection of keys from record (Linefeed) • Tell the framework what your Key is and what values that key will hold • MR will deal with actual creation of the Map • Control on what keys to include or what values to filter out • End up with a giant hashtable
  9. 9. (MAP) REDUCE • Reducing Data: Once the map phase is complete code moves on to the reduce phase. The reduce phase works on mapped data and can potentially do all the aggregation and summation activities. • Finally you get a blob of the mapped and reduced data.
  10. 10. JAVA… VS. PIG… • Pig is a querying engine • Has a „business-friendly‟ syntax • Spits out MapReduce code • syntax for Pig is called : Pig Latin (Don‟t ask) • Pig Latin is very similar syntactically to LINQ. • Pig converts into MapReduce and sends it off to Hadoop then retrieves the results • Half the performance • 10 times faster to write
  11. 11. HBASE • HBase is a key value store on top of HDFS • This is the NOSql Database • Very thin layer over raw HDFS • Data is grouped in a Table that has rows of data. • Each row can have multiple „Column Families‟ • Each „Column Family‟ contain(s) multiple columns. • Each column name is the key and it has it‟s corresponding column value. • Each row doesn‟t need to have the same number of columns
  12. 12. HIVE • Hive is a little closer to RDBMS systems • Is a DWH system on top of HDFS and Hbase • Performs join operations between HBase tables • Maintains a meta layer • data summation, ad-hoc queries and analysis of large data stores in HFDS • High level language • Hive Query Language, looks like SQL but restricted • No, Updates or Deletes are allowed • partitioning can be used to update information o Essentially re-writing a chunk of data.
  13. 13. WINDOWS HADOOP- PROJECT ISOTOPE • 2 Flavours • Cloud o Azure CTP • On Permise o integration of the Hadoop File System with Active Directory o integrate System Center Operations Manager with Hadoop o BI Integration • Are not all that interesting in and of themselves, but data and tools are o Sqoop – Integration with SQL Server o Flume – Access to Lots of data
  14. 14. SQOOP • Is a framework that facilitates transfer between (RDBMS) and HDFS. • Uses MapReduce programs to import and export data; • Imports and exports are performed in parallel with fault tolerance. • Source / Target files being used by Sqoop can be: • delimited text files • binary SequenceFiles containing serialized record data.
  15. 15. SQL SERVER – HORTONWORKS - HADOOP • Spin-off from Yahoo • Bridge the technological gaps between Hadoop and Windows Server • CTP of the Hadoop-based distribution for Windows Server ( somewhere in 2012) • Will work with Microsoft‟s business-intelligence tools • including o Excel o PowerPivot o PowerView
  16. 16. HADOOP CONNECTORS • SQL Server versions • Azure • PDW • SQL 2012 • SQL 2008 R2 http://www.microsoft.com/download/en/details.aspx?id=27584
  17. 17. WITH SQL SERVER-HADOOP CONNECTOR, YOU CAN: • Sqoop-based connector • Import • tables in SQL Server to delimited text files on HDFS • tables in SQL Server to SequenceFiles files on HDFS • tables in SQL Server to tables in Hive • Result of queries executed on SQL Server to delimited text files on HDFS • Result of queries executed on SQL Server to SequenceFiles files on HDFS • Result of queries executed on SQL Server to tables in Hive • Export • Delimited text files on HDFS to SQL Server • DequenceFiles on HDFS to SQL Server • Hive Tables to tables in SQL Server
  18. 18. SQL SERVER 2012 ALONGSIDE THE ELEPHANT • PowerView utilizes its own class of apps, if you will, that Microsoft is calling insights. • SQL Server will extend insights to Hadoop data sets • Interesting insights can be • Brought into a SQL Server environment using connectors • Drive analysis across it using BI tools.
  19. 19. WHY USE HADOOP WITH SQL SERVER • Don‟t just think about big data being large volumes • Analyze both structured and unstructured datasets • Think about workload, growth, accessibility and even location • Can the amount of data stored every day reliably written to a traditional HDD • Mapreduce is more complex then TSQL • Many companies try to avoid writing java for queries • Front ends are immature relative to the tooling available in the relational database world • It‟s not going to replace your database, but your database isn‟t likely to replace Hadoop either.
  20. 20. MICROSOFT AND HADOOP • Broader access of Hadoop to: • End users • IT professionals • Developers • Enterprise ready Hadoop distribution with greater security, performance, ease of management. • Breakthrough insights through the use of familiar tools such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services.
  21. 21. ENTERPRISE HADOOP • Installation wizard (IsotopeClusterDeployment) • Healtcheck and monitoring pages • Interactive Javascript Console
  22. 22. MICROSOFT ENTERPRISE HADOOP • Machines in the Hadoop cluster must be running Windows Server 2008 or higher • Ipv4 network enabled on all nodes • Deployment does not work on Ipv6 only network. • The ability to create a new user account called “Isotope”. • Will be created on all nodes of the cluster. • Used for running Hadoop daemons and running jobs. • Must be able to copy and install the deployment binaries to each machine • Windows File Sharing services must be enabled on each machine that will be joined to the Hadoop cluster. • .Net Framework 4 installed on all nodes. • Minimum of 10G free space in C drive (JBOD HDFS configuration is supported)
  23. 23. © 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Notes de l'éditeur

  • 1. Data is not stored in the traditional table column format. At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.2. HDFS supports only forward only parsing. So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Insert’.3. Databases built on HDFS don’t guarantee ACID properties. Specially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP systems. RDBMS still rock there.4. Taking code to the data. In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.5. Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
  • Mapping Data: If it is plain de-limited text data, you have the freedom to pick your selection of keys from the record (remember records are typically linefeed separated) and values and tell the framework what your Key is and what values that key will hold. MR will deal with actual creation of the Map. When the map is being created you can control on what keys to include or what values to filter out. In the end you end up with a giant hashtable of filtered key value pairs. Now what?
  • Well, if you are that scared of Java, then you have Pig. No, I am not calling names here. Pig is a querying engine that has more ‘business-friendly’ syntax but spits out MapReduce code in the backend and does all the dirty work for you. The syntax for Pig is called, of course, Pig Latin.When you write queries in Pig Latin, Pig converts it into MapReduce and sends it off to Hadoop, then retrieves the results and hands it back to you.Analysis shows you get about half the performance of raw optimal hand written MapReduce java code, but the same code takes more than 10 times the time to write when compared to a Pig query.If you are in the mood for a start-up idea, generating optimal MapReduce code from Pig Latin is a topic to consider  …For those in the .NET world, Pig Latin is very similar syntactically to LINQ.
  • HBase is a key value store that sits on top of HDFS. It is a NOSql Database.It has a very thin veneer over raw HDFS where in it mandates that data is grouped in a Table that has rows of data.Each row can have multiple ‘Column Families’ and each ‘Column Family’ can contain multiple columns.Each column name is the key and it has it’s corresponding column value.So a column of data can be represented asrow[family][column] = valueEach row need not have the same number of columns. Think of each row as a horizontal linked list, that links to a column family and then each column family links to multiple columns as <Key, Value> pairs.row1->family1->col A = val A->family2->col B = val Band so on.
  • Hive is a little closer to traditional RDBMS systems. In fact it is a Data Warehousing system that sits on top of HDFS but maintains a meta layer that helps data summation, ad-hoc queries and analysis of large data stores in HFDS.Hive supports a high level language called Hive Query Language, that looks like SQL but restricted in a few ways like no, Updates or Deletes are allowed. However Hive has this concept of partitioning that can be used to update information, which is essentially re-writing a chunk of data whose granularity depends on the schema design.Hive can actually sit on top of HBase and perform join operations between HBase tables.
  • Isotope is more than the distributions that the Softies are building with Hortonworks. Isotope also refers to the whole “tool chain” of supporting big-data analytics offerings that Microsoft is packaging up around the distributions. Microsoft’s big-picture concept is Isotope is what will give all kinds of users, from technical to “ordinary” productivity workers, access from inside data-analysis tools they know — like Microsoft’s own SQL Server Analysis Services, PowerPivot and Excel on their PCs — to data stored in Windows Servers and/or Windows Azure. (The Windows Azure Marketplace fits in here, as this is the place that third-party providers can publish free or paid collections of data which users will be able to download/buy.)To accelerate its adoption in the Enterprise, Microsoft will make Hadoop Enterprise ready by  Active Directory Integration: Providing Enterprise-class security through integration of Hadoop with Active Directory  High Performance: Boosting Hadoop performance to offer consistently high data throughput  System Center Integration: Simplifying management of the Hadoop infrastructure through integration with Microsoft’s management tools such as System Center  BI Integration: Enabling integration of relational and Hadoop data into Enterprise BI solution with Hadoop connectors  Flexibility and Choice with deployment options for Windows Server and Windows Azure which offers customers: o Freedom to choose: More control as they can choose which data to keep in-house instead of the cloud. o Lower TCO: Cost saving, as fewer resources are required to run their Hadoop deployment in the cloud o Elasticity to meet demand: Elasticity reduces your costs, since more nodes can be added to the Windows Azure deployment for more demanding workloads. In addition, the Azure deployment of Hadoop can be used to extend the on premise solution in periods of high demand o Increased Performance: Bringing computing closer to the data – our solution enables customers to process data closer to where data is born, whether on premise or in the cloud We do this while maintaining compatibility with existing Hadoop tools such as Pig, Hive, and Java. Our goal is to ensure that applications built on Apache Hadoop can be easily migrated to our distribution to run on Windows Azure or Windows Server.
  • For developers, Microsoft is investing to make JavaScript a first class language within Big Data by making it possible to write high performance Map/Reduce jobs using JavaScript. In addition, our JavaScript console will allow users to write JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries from the browser to execute their Hadoop jobs. Analyze Hadoop data with familiar tools such as Excel, thanks to a Hive Add-in for Excel • Reduce time to solution through integration of Hive and Microsoft BI tools such as PowerPivot and Power View • Build corporate BI solutions that include Hadoop data, through integration of Hive and leading BI tools such as SQL Server Analysis Services and Reporting ServicesCustomers can use this connector (on an already deployed Hadoop cluster) to analyze unstructured or semi-structured data from various sources and then load the processed data into PDW Efficiently transfer terabytes of data between Hadoop and PDW Enables users to get the best of both worlds: Hadoop for processing large volumes of unstructured data, and PDW for analyzing structured data with easy integration to BI tools Use of Map-Reduce and PDW Bulk Load/Extract tool for fast import/export
  • Sqoop is an open source connectivity framework that facilitates transfer between multiple Relational Database Management Systems (RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data; the imports and exports are performed in parallel with fault tolerance. The Source / Target files being used by Sqoop can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Please refer to section 7.2.7 in Sqoop User Guide for more details on supported file types. For information on SequenceFile format, please refer to Hadoop API page.
  • Broader access to Hadoop through simplified deployment and programmability. Microsoft has simplified setup and deployment of Hadoop, making it possible to setup and configure Hadoop on Windows Azure in a few hours instead of days. Since the service is hosted on Windows Azure, customers only download a package that includes the Hive Add-in and Hive ODBC Driver. In addition, Microsoft has introduced new JavaScript libraries to make JavaScript a first class programming language in Hadoop. Through this library JavaScript programmers can easily write MapReduce programs in JavaScript, and run these jobs from simple web browsers. These improvements reduce the barrier to entry, by enabling customers to easily deploy and explore Hadoop on Windows. Breakthrough insights through integration Microsoft Excel and BI tools. This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoop using award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop. Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.
  • Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.
  • through easy installation and configuration and simplified programming with JavaScript.The CTP of Microsoft's Hadoop based Service for Windows Azure is now available. Complete the online form with details of your Big Data scenario to download the preview. Microsoft will issue a code that will be used by the selected customers to access the Hadoop based Service.
  • Gain new insights from your dataHave you ever had trouble finding data you needed? Or combining data from different, incompatible sources? How about sharing the results with others in a web-friendly way? If so, we want you to try Microsoft Codename “Data Explorer” Cloud service.With "Data Explorer" you can:Identify the data you care about from the sources you work with (e.g. Excel spreadsheets, files, SQL Server databases).Discover relevant data and services via automatic recommendations from the Windows Azure Marketplace.Enrich your data by combining it and visualizing the results.Collaborate with your colleagues to refine the data.Publish the results to share them with others or power solutions.In short, we help you harness the richness of data on the Web to generate new insights.
  • Blue - Use for Cloud on Your Terms specific content
  • Green - Use for Mission Critical Confidence specific content
  • Orange - Use for Breakthrough Insight specific content