Data Integration through Data Virtualization - PolyBase and new SQL Server 2019 Features (Presented at SQL Server Konferenz 2019 on February 21st, 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
1. Data Integration through
Data Virtualization
Cathrine Wilhelmsen, Inmeta
@cathrinew | cathrinew.net
February 21st 2019
2. Abstract
Data virtualization is an alternative to Extract, Transform and Load (ETL) processes. It handles the
complexity of integrating different data sources and formats without requiring you to replicate or
move the data itself. Save time, minimize effort, and eliminate duplicate data by creating a virtual
data layer using PolyBase in SQL Server.
In this session, we will first go through fundamental PolyBase concepts such as external data sources
and external tables. Then, we will look at the PolyBase improvements in SQL Server 2019. Finally, we
will create a virtual data layer that accesses and integrates both structured and unstructured data
from different sources. Along the way, we will cover lessons learned, best practices, and known
limitations.
45. 1. Install Prerequisites
Microsoft .NET Framework 4.5
Oracle Java SE Runtime Environment (JRE) 7 or 8
2. Install PolyBase
Single Node or Scale-Out Group
3. Enable PolyBase
46. Install Prerequisites
Microsoft .NET Framework 4.5
https://www.microsoft.com/nl-nl/download/details.aspx?id=30653
Oracle Java SE Runtime Environment (JRE) 7 or 8
https://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html
47. Install PolyBase
Note: PolyBase can be installed on only one SQL
Server instance per machine.
Note: After you install PolyBase either standalone
or in a scale-out group, you have to uninstall and
reinstall to change it.
. . . Ask me how I know : )
80. Create Statistics
Note: To create statistics, SQL Server imports the
external data into temp table first. Remember to
choose sampling or full scan.
Note: Updating statistics is not supported. Drop
and re-create instead.
81. Create Statistics
CREATE STATISTICS <StatName>
ON <TableName>(<ColumnName>);
CREATE STATISTICS <StatName>
ON <TableName>(<ColumnName>) WITH FULLSCAN;
87. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Not enough columns in this line.
88. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Too many columns in the line.
89. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Could not find a delimiter after
string delimiter.
90. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Error converting data type NVARCHAR to INT.
91. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Conversion failed when converting the
NVARCHAR value '"0"' to data type BIT.
92. Unexpected error encountered filling record
reader buffer: HadoopExecutionException:
Too long string in column [-1]:
Actual len = [4242]. MaxLEN=[4000]
93. Msg 46518, Level 16, State 12, Line 1:
The type 'nvarchar(max)' is not supported
with external tables.
94. Msg 2717, Level 16, State 2, Line 1:
The size (10000) given to the parameter
exceeds the maximum allowed (4000).
95. Msg 131, Level 15, State 2, Line 1:
The size (10000) given to the column
exceeds the maximum allowed for any data
type (8000).
98. SQL Server 2019 Big Data Clusters
SQL Server, Spark, and HDFS
Scalable clusters of containers
Runs on Kubernetes
99. Kubernetes Pod Kubernetes Pod Kubernetes Pod Kubernetes Pod
SQL Server
Master Instance
SQL Server
HDFS Data Node
SparkSQL Server
HDFS Data Node
Spark SQL Server
HDFS Data Node
Spark SQL Server
HDFS Data Node
Spark
109. Biml 💚 PolyBase
Ben Weissman:
Using Biml to automagically keep your external
polybase tables in sync!
https://www.solisyon.de/biml-polybase-external-tables/
110.
111. Azure Data Studio
1. Install Azure Data Studio
docs.microsoft.com/en-us/sql/azure-data-studio/download
2. Install Extension: SQL Server 2019 (Preview)
docs.microsoft.com/en-us/sql/azure-data-studio/sql-server-2019-extension