Im zweiten Teil unserer Microsoft Big Data Session geht es darum, wie Big Data Informationen über "klassisches" SQL zugänglich gemacht werden können und wie sich mit der neuen PolyBase-Engine unstrukturierte Hadoop-Daten mit relationalen Data Warehouse-Daten einfach verknüpfen lassen.
In der Hadoop-Welt wird der SQL-Zugriff über die Komponente Hive ermöglicht.
Über den Microsoft Hive ODBC-Konnektor können die üblichen BI-Tools, wie PowerPivot, diesen Zugriff direkt nutzen.
Die PolyBase-Engine schließlich wird ein Bestandteil des SQL Server 2012 Parallel Data Warehouse werden und erlaubt einem transparenten SQL-Zugriff, egal, wo sich die Daten befinden.
4. Hadoop & Business Intelligence
Hadoop is great for storing & processing *large*
amounts of data
(but) Map/Reduce jobs are kind of low level
(most) BI tools rely on relational or
multidimensional data sources
and declarative languages like SQL | MDX | DAX
?
5. The Hive Project
Hive was started at Facebook (2008)
Goal:
empower business users to query Hadoop clusters
with standard tools & SQL
Famous paper at VLDB conference 2009
in 2009 already 700TB data „lived“ in Hive at
Facebook: 5.000 queries a day from over 100 users
Hive is a „Data Warehouse“ for Hadoop!
(a system for managing data structures build on top of Hadoop)
http://www.vldb.org/pvldb/2/vldb09-938.pdf
6. Hive architecture
Query Language: HiveQL
(subset of SQL)
Uses Map/Reduce for execution
Rule based
optimizer
Driver
(Compiler, Optimizer, Executor)
Command Line
Interface
Web
Interface
Thrift Server
Metastore
JDBC ODBC
Performance is an issue:
Hortonworks stinger initiative aims for „human-time use cases“
7. Hive concepts
Well known: Databases | Tables | Rows & Columns
Table = (file or) directory
e.g. twitter_feeds -> /user/hive/warehouse/twitter_feeds
Storage: ORC (Optimized Row Columnar), Textfile, RCFile
(Record Columnar File), etc.
Primitive Types: integer, float, string, date, boolean
…plus arrays, maps, user defined types
Partitions = subdirectories
Indexes = data subsets or bitmaps
HiveQL: SELECT…FROM…WHERE (incl.
Joins, Aggregates, Union All, Subqueries)
Can embed M/R scripts
8. HiveQL Example
CREATE TABLE logdata(
logdate string,
logtime string,
…
time_taken int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/w3c/input/data' OVERWRITE INTO TABLE
logdata;
SELECT logdate, logtime, time_taken FROM logdata LIMIT 200;
20. V1 Reference PDW V2
The Basic Full Rack 10X Faster & 50% Lower Capital Cost
Control Node
Mgmt. Node
Landing Zone
Backup Node
Estimated Total HW List Price: $1MM$ Estimated Total HW List Price: $500K$
Infiniband & Ethernet
Fiber Channel
70% more disk I/O bandwidth
Infiniband
& Ethernet
• 128 cores on 8 compute nodes
• 2TB of RAM on compute
• Up to 168 TB of tempdb
• Up to 1PB of user data
• 160 cores on 10 compute nodes
• 1.28 TB of RAM on compute
• Up to 30 TB of tempdb
• Up to 150 TB of user data
ComputeNode Storage
Compute & Storage
Control Node
21. SQL Server 2012 Parallel Data Warehouse
Up- & downscale
2-56 Compute Nodes
Unique standardized Nodes: 256 GB
1-6 Racks
Compute & Storage Nodes are
VMs
Simple Management
Hardware Abstraction
Different Workloads
ColumnStore v2 storage
High compressions
Updatable for incremental loads
Development Goals
FDR
Infiniband
Direct attached SAS
Hardware Architecture
22. Startsmall
&grow
Dynamic Scale Up
Start small with a few Terabyte warehouse
Add capacity up to 5 Petabytes
Increments by 2-3 Compute Nodes plus
Storage
0TB 5 PB
Enterprise
Warehouse
PB
NoDowntime
23. Infiniband
Infiniband
Ethernet
Ethernet
Control Node
Failover Node
JBOD 1
Compute Node 1
Compute Node 2
JBOD 2
Compute Node 3
Compute Node 4
JBOD 3
Compute Node 5
Compute Node 6
JBOD 4
Compute Node 7
Compute Node 8
Customer
Use
Base Unit (6U):
• Redundant Infiniband
• Redundant Ethernet
• Mgmt & Control (Active)
• Rack Failover Node (Passive)
Base Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
¼Rack
15TB
(Raw)
1/2Rack
30TB(Raw)
Customer Space (8U)
• ETL Servers
• Backup Servers
• Passive Unit (Additional spares)
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
FullRack
60TB(Raw)
Infiniband
Infiniband
Ethernet
Ethernet
Failover Node
JBOD 5
Compute Node 9
Compute Node 10
JBOD 6
Compute Node 11
Compute Node 12
JBOD 7
Compute Node 13
Compute Node 14
JBOD 8
Compute Node 15
Compute Node 16
Customer
Use
Extension Base Unit (5U):
• Redundant Infiniband
• Redundant Ethernet
• Rack Failover Node (Passive)
Extension Base Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
1¼Rack
75.5TB
(Raw)
Customer Space (9U)
• ETL Servers
• Backup Servers
• Passive Unit (Additional spares)
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
3Rack
181.2TB(Raw)
11/2Rack
90.6TB(Raw)
2Rack
120.8TB(Raw)
Infiniband
Infiniband
Ethernet
Ethernet
Failover Node
JBOD 9
Compute Node 17
Compute Node 18
JBOD 10
Compute Node 19
Compute Node 20
JBOD 11
Compute Node 21
Compute Node 22
JBOD 12
Compute Node 23
Compute Node 24
Customer
Use
Extension Base Unit (5U):
• Redundant Infiniband
• Redundant Ethernet
• Rack Failover Node (Passive)
Extension Base Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Customer Space (9U)
• ETL Servers
• Backup Servers
• Passive Unit (Additional spares)
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
• 2 – 56 compute nodes
• 1 – 7 racks
• 1, 2, or 3 TB drives
• 15.1 – 1268.4 TB raw
• 53 – 6342 TB User data
• Up to 7 spare nodes
available across the
entire appliance
HP
Configuration
24. VMs for different workloads
(e.g. HDInsight zone)
Storage Spaces manage
physical disks on JBOD(s)
33 logical mirrored drives
(66 drives & 4 hot spares)
Clustered Shared Volumes
(CSV) allows all
nodes to access the LUNs on the JBOD
One cluster across the whole appliance
VMs are automatically migrated on failure
Host 1
Host 0
Host 2
Host 3
JBOD
CTL MAD
FAB
AD
VMM
Compute 1
Compute 2
Host 2
Compute 1
Agility Due to Virtualization
* 3 nodes per JBOD
in Dell Configuration
25. xVelocity Columnstore as primary Storage
C1
C
2
C4 C5 C6
C
3
T.C1 T.C3T.C2 T.C4
T.C1 T.C3T.C2 T.C4
T.C1 T.C3T.C2 T.C4
Better IO & Caching
columns stored independend
early segment elimination
aggressives read ahead
Speicher-Optimierung
new Memory Broker
segments are loaded when
needed
…and stay as long as possible
Batch Mode
max. parallelism
ca. 1.000 values per kernel
CPU time is reduced by
ratio 7 to 40
SELECT Region, SUM(Sales) … T.C2 T.C4
Bitmapofqualifiedrows
Column vectors
Batch-
Object
27. Compression Rates in the Demo
82
34.2
22.9
11.2
4
0
10
20
30
40
50
60
70
80
90
Rohdaten Page Compression Backup
Compression
Columnstore-Index
(Disk)
Columstore-Index
(Memory)
in GB
Kompressionsrate bei *diesen* Daten: ca. 20
28. Columnstore: The next Generation
Columnstore becomes primary data
structure
(clustered index)
No need for base table
Allows Updates & Deletes
(temporary row store)
Easy data managment
Improvements:
Supports all (reasonable) data types
Support more query operators
Statistics on partitioned tables
PDW v2 & SQL Server 2014
29. Microsoft BI Stack Connectivity
Targeted Driver SNAC11 SNAC10 .NET (sqlclient)
OLEDB ODBC OLEDB ODBC
PDW Tool 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit
SSRS/Reporting Services
SSRS - SS2012 (report builder and SSDT) X X X X X X
SSRS - SS2008 R2 (report builder) X n/a X n/a X n/a
SSAS/Analysis Services
SSAS – SS2012 X X X X X X X X X X
SSAS – SS2008 R2 X X X X X X
Linked Server (DQ)
DQ – SS2008 X X X X
SSIS/Integration Services
SSIS - SS2012 X X X X X X X X X X
SSIS - SS2008 R2 X X X X X X
PowerPivot for Excel n/a n/a X X n/a n/a X X
Power View
SS2012 w/ and w/o direct query X X X X X X
MS BI Direct Query
Excel X X X X n/a n/a
Direct Query n/a n/a X X n/a n/a
Access n/a n/a X X n/a n/a
Master Data Services
SS2012 X X X X X X
SS2008 X X X X
Quality Services
SS2012 X X X X X X
SS2008 X X X X
30. Monitoring
Build in Monitoring by GUI or Management Views
(DMVs)
System Center Management Packs for PDW
31. Simple Resource Management
Pre-built resource classes in PDW
Resource class =
PDW concurrency slots in use
Memory utilization
Priority
DBA controls how requests are
mapped
to resource classes.
PDW honors resource class at run-
time
PDW Concurrency slots in use: 3
Memory: V1 HW – 600MB, V2 HW – 1.2GB
Priority: Medium
PDW Concurrency slots in use: 21
Memory: V1 HW ~4.2GB, V2 HW ~8.4GB
Priority: High
PDW Concurrency slots in use: 1
Memory: V1 HW ~200MB, V2 HW ~400MB
Priority: Medium
PDW Concurrency slots in use: 7
Memory: V1 HW – 1.4GB, V2 HW ~2.8GB
Priority: High
32. Improve T-SQL Parity
T-SQL additions to increase compatibility:
SQL Server Data Tools
Microsoft BI Tools
3rd Party Tools, like Tableau
Dedicated PDW tools are deprecated
Catalog SPs examples:
sp_tables_rowset;2
sp_catalogs_rowset
sp_executesql
Built-in function examples:
db_id
db_name
object_id
General T-SQL improvements examples:
cross/outer apply
sp_prepare
sp_execute
Configuration Functions:
• @@LANGUAGE
• @@SPID
SET options:
• SET
ROWCOUNT
• SET FMTONLY
34. Hadoop / Big Data-Integration: Microsoft
T-SQL query engine for RDBMS & Hadoop
Cost base optimizer. decides on:
Moving HDFS data into RDBMS storage
Rendering operators in Map/Reduce-Jobs or
HDFS-Bridge for parallelized Data Transport
&
HDFS Data Nodes
Regular T-SQL Results
PDW V2
35. External Tables are mapped to HDFS files
Fields in the file are defined as columns in
the PDW External table
File characteristics are also provided during
definition
This works for HDInsight, Hortonworks HDP
& Cloudera
CREATE EXTERNAL TABLE ClickEvent
(
url varchar(50),
event_date date,
user_IP varchar(50)),
WITH
(LOCATION
=‘hdfs://MyHadoop:5000/clickstream/click.
txt’,
FORMAT_OPTIONS
(FIELD_TERMINATOR = '|'));
External Tables
Hadoop Integration
37. Polybase: A really simple Query Example
Here: external data
movement is
ExternalRoundRobinMove
Parallel HDFS readers will
run on every data node
(e.g. 10 nodes à 8 threads)
39. Single Query; Structured and Unstructured
Query and join Hadoop tables with Relational Tables
Use Standard SQL language
ExistingSQL
Skillset
NoIT
Intervention
SaveTime
andCostsDatabas
e
HDFS
(Hadoop)
SQL Server
2012 PDW
Powered by
PolyBase
SQL
Analyzeall
DataTypes
PolyBase: Breakthrough in Data Processing
40.
41. Resources
SQL Server CAT-Blog
http://sqlcat.com
bwin Case Study
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000001470
Microsoft Big Data Site
http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-
data.aspx
Introduction to Hadoop on Windows Azure
http://channel9.msdn.com/Events/windowsazure/learn/Introduction-to-Hadoop-on-Windows-Azure
SQL Server Team Blog
http://blogs.technet.com/b/dataplatforminsider
Microsoft YouTube Big Data Channel
http://www.youtube.com/playlist?list=PLD471EE01A293CC34
TechEd Sessions
http://channel9.msdn.com/Events/TechEd
Microsoft Connect (Product Feedback)
http://connect.microsoft.com