SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)

SQLSaturday #230 Rheinland
Sascha Dittmann
Softwarearchitekt & Entwickler – Ernst & Young GmbH
www.sascha-dittmann.de
Georg Urban
Snr. Technology Solution Professional | Data Platform
georg.urban@microsoft.com
13.07.2013

Hadoop & Business Intelligence
 Hadoop is great for storing & processing *large*
amounts of data
 (but) Map/Reduce jobs are kind of low level
 (most) BI tools rely on relational or
multidimensional data sources
and declarative languages like SQL | MDX | DAX
?

The Hive Project
 Hive was started at Facebook (2008)
 Goal:
empower business users to query Hadoop clusters
with standard tools & SQL
 Famous paper at VLDB conference 2009
in 2009 already 700TB data „lived“ in Hive at
Facebook: 5.000 queries a day from over 100 users
Hive is a „Data Warehouse“ for Hadoop!
(a system for managing data structures build on top of Hadoop)
http://www.vldb.org/pvldb/2/vldb09-938.pdf

Hive architecture
 Query Language: HiveQL
(subset of SQL)
 Uses Map/Reduce for execution
 Rule based
optimizer
Driver
(Compiler, Optimizer, Executor)
Command Line
Interface
Web
Interface
Thrift Server
Metastore
JDBC ODBC
Performance is an issue:
Hortonworks stinger initiative aims for „human-time use cases“

Hive concepts
 Well known: Databases | Tables | Rows & Columns
 Table = (file or) directory
e.g. twitter_feeds -> /user/hive/warehouse/twitter_feeds
 Storage: ORC (Optimized Row Columnar), Textfile, RCFile
(Record Columnar File), etc.
 Primitive Types: integer, float, string, date, boolean
 …plus arrays, maps, user defined types
 Partitions = subdirectories
 Indexes = data subsets or bitmaps
 HiveQL: SELECT…FROM…WHERE (incl.
Joins, Aggregates, Union All, Subqueries)
 Can embed M/R scripts

HiveQL Example
CREATE TABLE logdata(
logdate string,
logtime string,
…
time_taken int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
LOAD DATA INPATH '/w3c/input/data' OVERWRITE INTO TABLE
logdata;
SELECT logdate, logtime, time_taken FROM logdata LIMIT 200;

HDInsight in Action: Managing Big Data

Enriching Big Data in PowerPivot

Big Data Mashup with Power View = Insights

Real World Big Data
 Yahoo! 180 PB raw data in > 40.000 computers (polystructured)*
 Biggest Hadoop cluster: 4.500 nodes
(2x4 CPUs, 4x1 TB disks, 16 GB RAM)
 Page Impressions:
 Cube with 207 Measures | 24 Dimensions | 247 Attributes
 Desktop Clients (MS & Tableau): < 6s ad hoc query time
http://wiki.apache.org/hadoop/PoweredBy

Real Life Example (Sensor Data Analytics)
XML,
structured
&
unstructured
files
Preprocessing(z.B.C24,…)
HiveHive
Integration Services
Database Service
IntegrationServices
ERP & other DBs
HDInsight
Browsers
Excel &
PowerPivot
3. Party
Mobile
Clients
Analytical & DM Tools
(R, SPSS, MS Data Mining,…)
Longterm Storage & Preprocessing
Power View
More
Microsoft…
(Reporting
Services,
Performance
Point,…)
Integration, Analysis & Persistance Publishing & Collaboration
Integration Services
SharePoint
Self Service
analytical
Applications
Analysis Services

SQL SERVER PARALLEL DATA
WAREHOUSE

Parallel Data Warehouse Concepts

V1 Reference PDW V2
The Basic Full Rack 10X Faster & 50% Lower Capital Cost
Control Node
Mgmt. Node
Landing Zone
Backup Node
Estimated Total HW List Price: $1MM$ Estimated Total HW List Price: $500K$
Infiniband & Ethernet
Fiber Channel
70% more disk I/O bandwidth
Infiniband
& Ethernet
• 128 cores on 8 compute nodes
• 2TB of RAM on compute
• Up to 168 TB of tempdb
• Up to 1PB of user data
• 160 cores on 10 compute nodes
• 1.28 TB of RAM on compute
• Up to 30 TB of tempdb
• Up to 150 TB of user data
ComputeNode Storage
Compute & Storage
Control Node

SQL Server 2012 Parallel Data Warehouse
 Up- & downscale
 2-56 Compute Nodes
 Unique standardized Nodes: 256 GB
 1-6 Racks
 Compute & Storage Nodes are
VMs
 Simple Management
 Hardware Abstraction
 Different Workloads
 ColumnStore v2 storage
 High compressions
 Updatable for incremental loads
Development Goals
FDR
Infiniband
Direct attached SAS
Hardware Architecture

Startsmall
&grow
Dynamic Scale Up
Start small with a few Terabyte warehouse
Add capacity up to 5 Petabytes
Increments by 2-3 Compute Nodes plus
Storage
0TB 5 PB
Enterprise
Warehouse
PB
NoDowntime

Infiniband
Infiniband
Ethernet
Ethernet
Control Node
Failover Node
JBOD 1
Compute Node 1
Compute Node 2
JBOD 2
Compute Node 3
Compute Node 4
JBOD 3
Compute Node 5
Compute Node 6
JBOD 4
Compute Node 7
Compute Node 8
Customer
Use
Base Unit (6U):
• Redundant Infiniband
• Redundant Ethernet
• Mgmt & Control (Active)
• Rack Failover Node (Passive)
Base Unit (7U):
• 2 HP 1U Servers
• (16 Cores/Ea. Total: 32)
• JBOD 5U
• 1TB Drives
• User Data Capacity: 75TB
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
¼Rack
15TB
(Raw)
1/2Rack
30TB(Raw)
Customer Space (8U)
• ETL Servers
• Backup Servers
• Passive Unit (Additional spares)
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
FullRack
60TB(Raw)
Infiniband
Infiniband
Ethernet
Ethernet
Failover Node
JBOD 5
Compute Node 9
Compute Node 10
JBOD 6
Compute Node 11
Compute Node 12
JBOD 7
Compute Node 13
Compute Node 14
JBOD 8
Compute Node 15
Compute Node 16
Customer
Use
Extension Base Unit (5U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
1¼Rack
75.5TB
(Raw)
Customer Space (9U)
• ETL Servers
• Backup Servers
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
3Rack
181.2TB(Raw)
11/2Rack
90.6TB(Raw)
2Rack
120.8TB(Raw)
Infiniband
Infiniband
Ethernet
Ethernet
Failover Node
JBOD 9
Compute Node 17
Compute Node 18
JBOD 10
Compute Node 19
Compute Node 20
JBOD 11
Compute Node 21
Compute Node 22
JBOD 12
Compute Node 23
Compute Node 24
Customer
Use
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Customer Space (9U)
• ETL Servers
• Backup Servers
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
Scale Unit (7U):
• 2 HP 1U Servers
• JBOD 5U
• 1TB Drives
• 2 – 56 compute nodes
• 1 – 7 racks
• 1, 2, or 3 TB drives
• 15.1 – 1268.4 TB raw
• 53 – 6342 TB User data
• Up to 7 spare nodes
available across the
entire appliance
HP
Configuration

 VMs for different workloads
(e.g. HDInsight zone)
 Storage Spaces manage
 physical disks on JBOD(s)
 33 logical mirrored drives
(66 drives & 4 hot spares)
 Clustered Shared Volumes
(CSV) allows all
nodes to access the LUNs on the JBOD
 One cluster across the whole appliance
 VMs are automatically migrated on failure
Host 1
Host 0
Host 2
Host 3
JBOD
CTL MAD
FAB
AD
VMM
Compute 1
Compute 2
Host 2
Compute 1
Agility Due to Virtualization
* 3 nodes per JBOD
in Dell Configuration

xVelocity Columnstore as primary Storage
C1
C
2
C4 C5 C6
C
3
T.C1 T.C3T.C2 T.C4
T.C1 T.C3T.C2 T.C4
T.C1 T.C3T.C2 T.C4
Better IO & Caching
 columns stored independend
 early segment elimination
 aggressives read ahead
Speicher-Optimierung
 new Memory Broker
 segments are loaded when
needed
 …and stay as long as possible
Batch Mode
 max. parallelism
 ca. 1.000 values per kernel
 CPU time is reduced by
ratio 7 to 40
SELECT Region, SUM(Sales) … T.C2 T.C4
Bitmapofqualifiedrows
Column vectors
Batch-
Object

Compression Rates in the Demo
82
34.2
22.9
11.2
4
0
10
20
30
40
50
60
70
80
90
Rohdaten Page Compression Backup
Compression
Columnstore-Index
(Disk)
Columstore-Index
(Memory)
in GB
Kompressionsrate bei *diesen* Daten: ca. 20

Columnstore: The next Generation
 Columnstore becomes primary data
structure
(clustered index)
 No need for base table
 Allows Updates & Deletes
(temporary row store)
 Easy data managment
 Improvements:
 Supports all (reasonable) data types
 Support more query operators
 Statistics on partitioned tables
PDW v2 & SQL Server 2014

Microsoft BI Stack Connectivity
Targeted Driver SNAC11 SNAC10 .NET (sqlclient)
OLEDB ODBC OLEDB ODBC
PDW Tool 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit 32 bit 64 bit
SSRS/Reporting Services
SSRS - SS2012 (report builder and SSDT) X X X X X X
SSRS - SS2008 R2 (report builder) X n/a X n/a X n/a
SSAS/Analysis Services
SSAS – SS2012 X X X X X X X X X X
SSAS – SS2008 R2 X X X X X X
Linked Server (DQ)
DQ – SS2008 X X X X
SSIS/Integration Services
SSIS - SS2012 X X X X X X X X X X
SSIS - SS2008 R2 X X X X X X
PowerPivot for Excel n/a n/a X X n/a n/a X X
Power View
SS2012 w/ and w/o direct query X X X X X X
MS BI Direct Query
Excel X X X X n/a n/a
Direct Query n/a n/a X X n/a n/a
Access n/a n/a X X n/a n/a
Master Data Services
SS2012 X X X X X X
SS2008 X X X X
Quality Services
SS2012 X X X X X X
SS2008 X X X X

Monitoring
 Build in Monitoring by GUI or Management Views
(DMVs)
 System Center Management Packs for PDW

Simple Resource Management
 Pre-built resource classes in PDW
 Resource class =
 PDW concurrency slots in use
 Memory utilization
 Priority
 DBA controls how requests are
mapped
to resource classes.
 PDW honors resource class at run-
time
PDW Concurrency slots in use: 3
Memory: V1 HW – 600MB, V2 HW – 1.2GB
Priority: Medium
Memory: V1 HW ~4.2GB, V2 HW ~8.4GB
Priority: High
Memory: V1 HW ~200MB, V2 HW ~400MB
Priority: Medium
Memory: V1 HW – 1.4GB, V2 HW ~2.8GB
Priority: High

Improve T-SQL Parity
T-SQL additions to increase compatibility:
 SQL Server Data Tools
 Microsoft BI Tools
 3rd Party Tools, like Tableau
Dedicated PDW tools are deprecated
Catalog SPs examples:
sp_tables_rowset;2
sp_catalogs_rowset
sp_executesql
Built-in function examples:
db_id
db_name
object_id
General T-SQL improvements examples:
cross/outer apply
sp_prepare
sp_execute
Configuration Functions:
• @@LANGUAGE
• @@SPID
SET options:
• SET
ROWCOUNT
• SET FMTONLY

Hadoop / Big Data-Integration: Microsoft
 T-SQL query engine for RDBMS & Hadoop
 Cost base optimizer. decides on:
 Moving HDFS data into RDBMS storage
 Rendering operators in Map/Reduce-Jobs or
 HDFS-Bridge for parallelized Data Transport
&
HDFS Data Nodes
Regular T-SQL Results
PDW V2

External Tables are mapped to HDFS files
Fields in the file are defined as columns in
the PDW External table
File characteristics are also provided during
definition
This works for HDInsight, Hortonworks HDP
& Cloudera
CREATE EXTERNAL TABLE ClickEvent
(
url varchar(50),
event_date date,
user_IP varchar(50)),
WITH
(LOCATION
=‘hdfs://MyHadoop:5000/clickstream/click.
txt’,
FORMAT_OPTIONS
(FIELD_TERMINATOR = '|'));
External Tables
Hadoop Integration

Polybase: Creating an external Table
 familiar tooling:
SSMS & Data Tools

Polybase: A really simple Query Example
 Here: external data
movement is
ExternalRoundRobinMove
 Parallel HDFS readers will
run on every data node
(e.g. 10 nodes à 8 threads)

BI & Big Data Solution with SQL PDW v2
&

Single Query; Structured and Unstructured
Query and join Hadoop tables with Relational Tables
Use Standard SQL language
ExistingSQL
Skillset
NoIT
Intervention
SaveTime
andCostsDatabas
e
HDFS
(Hadoop)
SQL Server
2012 PDW
Powered by
PolyBase
SQL
Analyzeall
DataTypes
PolyBase: Breakthrough in Data Processing

Resources
 SQL Server CAT-Blog
http://sqlcat.com
 bwin Case Study
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000001470
 Microsoft Big Data Site
http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-
data.aspx
 Introduction to Hadoop on Windows Azure
http://channel9.msdn.com/Events/windowsazure/learn/Introduction-to-Hadoop-on-Windows-Azure
 SQL Server Team Blog
http://blogs.technet.com/b/dataplatforminsider
 Microsoft YouTube Big Data Channel
http://www.youtube.com/playlist?list=PLD471EE01A293CC34
 TechEd Sessions
http://channel9.msdn.com/Events/TechEd
 Microsoft Connect (Product Feedback)
http://connect.microsoft.com

Vielen Dank an die Volunteers!
13.07.2013 |

Große Verlosung!
 Am Ende der Veranstaltung (ca. 18:00 Uhr)
 Gewinnt viele Preise!
 Deshalb:
13.07.2013 |
Besucht unsere Sponsoren!

Unsere „You Rock! “ Sponsoren
13.07.2013 |

Vielen Dank an all unsere Sponsoren!
13.07.2013 |
Gold
Silber
Bronze

Hands-on event: PASS Camp 2013!
13.07.2013 |

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)

Similar to SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2) (20)

More from Sascha Dittmann

More from Sascha Dittmann (16)

Recently uploaded

Recently uploaded (20)

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)