This document provides an overview of SQL Server 2019 Big Data Clusters, which enable hybrid SQL Server/Spark scale-out data platforms that run on Kubernetes. Big Data Clusters are available in public preview and will generally be available in the second half of 2019. They provide a true scale-out data platform for aggregating data from various sources, using data science tools with sensitive data on the same platform, and storing/querying large amounts of unstructured data with SQL Server tools.
3. About Me
EMEA SQL Server Solutions architect for Pure Storage
A user of SQL Server since 2000
14+ years of SQL Server experience
Speaker on the SQL Server community circuit
9. Disclaimer
The big data cluster content in this deck
is correct as of release candidate 1 (RC1)
Things may (and will) change
between RC1 and the RTM version of SQL Server 2019
10. Hybrid SQL Server / Spark scale-out
data platform
Features next generation of PolyBase
Runs on Kubernetes
What Are Big Data Clusters ?
11. Available in public preview form
Kerberos integration only available in
private preview form
GA second half on 2019 sometime
Microsoft Ignite Time ?
The Story So Far
12. Containers at the bottom
Kubernetes in the middle
SQL Server 2019 big data clusters
at the top
The Different Layers Of The ‘Cake’
14. Scale-out
Container scheduling
Application resilience
Service discovery
Storage orchestration
Etc . . . Etc . . .
But All Is Not Well – What About ? . . .
18. Node Ports Services
Node Node
Pod Pod
31387
Node Node
Pod Pod
Connecting To Applications
31387 31387
19. ‘Master’ node(s)
{API} server
Control
Scheduling
Cluster state (etcd / cosmosdb)
The Kubernetes Cluster ‘Brain’
20. ‘Worker’ node(s)
Kubelets
Pods
Volumes
Container run time
The Kubernetes Cluster ‘Body’
21. Python
azdata (formally mssqlctl)
kubectl
vscode with Kubernetes extension
Azure data studio
SQL Server 2019 extension
SQL Server Management Studio version 18+
Tools
23. ODBC NoSQL Relational databases Big Data
SQL Server
Analytics Apps
PolyBase external tables
T-SQL
Big Data Cluster Architecture
24. azdata bdc hdfs mount create --remote-uri=s3a:<remote-uri> --mount-path=<mount-path>
HDFS Tiering
25. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
Appproxy-svc-external LoadBalancer 10.233.19.57 192.168.101.50 8080:30778/TCP 37h
Controller-svc-external LoadBalancer 10.233.63.91 192.168.101.51 80:30080/TCP 32h
Gateway-svc-external LoadBalancer 10.233.38.193 192.168.101.52 8443:30443/TCP 32h
Master-svc-external LoadBalancer 10.233.19.51 192.168.101.53 1433:31433/TCP 32h
Mgmtproxy-svc-external LoadBalancer 10.233.12.208 192.168.101.54 8080:30777/TCP 32h
azdata
Connecting To The Cluster
26. Storage pool: curl, Azure Data Factory, kubectl cp or HDFS tiering
Data pool: any TDS tool / client or external tables
Data Ingestion
27. Create a Kubernetes cluster
Install Python
Install azdata
Setup environment vars or
JSON
Create cluster with azdata
Creating A Big Data Cluster: One Way
28. Install Python
Install azdata
az login (get account id)
Download and run
deploy-sql-big-data-aks.py
. . . The Fastest Way
29. Worker node hosts running out of
space for docker images
docker rmi is your friend !!!
Using old versions of mssqlctl
mssqlctl –version to the rescue
Main Gotches I Ran Into
30. Kubernetes
+ storage
= source of great confusion
Ephemeral storage
= great for kicking the tyres
What about production grade
installations ?
IBM RAMAC
worlds first commercial hard drive
A Word On Storage
31. Data protection
Security
Elasticity
Licensing
Missing Pieces Of The Puzzle
34. Other Cool Stuff
Deploy and consume apps:
R, Python, Mleap and SSIS,
Data wrangling with PROSE code
accelerator,
Spark machine learning with Mleap,
Sparklyr,
Bulk processing in Spark.
35. New York Taxi dataset (S3)
imported via HDFS tiering
Data analysed with a python
notebook in Azure Data Studio
External table created for
analysis of the data using
Transact-SQL
Recorded Demo
https://www.youtube.com/watch?v=JXKvCLhKSw8