Data relay introduction to big data clusters

An Introduction To
SQL Server 2019
Big Data Clusters

#DataRelay@DataRelay_UK DataRelay.co.uk
Thank you to our sponsors. We couldn’t do it without
you!
PLATINUM
BRONZE
GOLD

About Me
 EMEA SQL Server Solutions architect for Pure Storage
 A user of SQL Server since 2000
 14+ years of SQL Server experience
 Speaker on the SQL Server community circuit

Does your organisation ? . . .

. . .aggregate data from
various data sources and
finds itself drowning in SSIS
packages ?

. . . use data science tools and
T-SQL, and it needs to keep
sensitive data on the same
platform for use by users of both
technologies ?

. . . need to store large
amounts of unstructured data
and query this using
traditional SQL Server tools ?

. . . need a true scale out
data platform?

Disclaimer
 The big data cluster content in this deck
is correct as of release candidate 1 (RC1)
 Things may (and will) change
between RC1 and the RTM version of SQL Server 2019

 Hybrid SQL Server / Spark scale-out
data platform
 Features next generation of PolyBase
 Runs on Kubernetes
What Are Big Data Clusters ?

 Available in public preview form
 Kerberos integration only available in
private preview form
 GA second half on 2019 sometime
Microsoft Ignite Time ?
The Story So Far

 Containers at the bottom
 Kubernetes in the middle
 SQL Server 2019 big data clusters
at the top
The Different Layers Of The ‘Cake’

 Scale-out
 Container scheduling
 Application resilience
 Service discovery
 Storage orchestration
 Etc . . . Etc . . .
But All Is Not Well – What About ? . . .

Youtube link
A Great Introduction to Kubernetes 101

Node Ports Services
Node Node
Pod Pod
31387
Node Node
Pod Pod
Connecting To Applications
31387 31387

 ‘Master’ node(s)
 {API} server
 Control
 Scheduling
 Cluster state (etcd / cosmosdb)
The Kubernetes Cluster ‘Brain’

 ‘Worker’ node(s)
 Kubelets
 Pods
 Volumes
 Container run time
The Kubernetes Cluster ‘Body’

 Python
 azdata (formally mssqlctl)
 kubectl
 vscode with Kubernetes extension
 Azure data studio
 SQL Server 2019 extension
 SQL Server Management Studio version 18+
Tools

ODBC NoSQL Relational databases Big Data
SQL Server
Analytics Apps
PolyBase external tables
T-SQL
Big Data Cluster Architecture

azdata bdc hdfs mount create --remote-uri=s3a:<remote-uri> --mount-path=<mount-path>
HDFS Tiering

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
Appproxy-svc-external LoadBalancer 10.233.19.57 192.168.101.50 8080:30778/TCP 37h
Controller-svc-external LoadBalancer 10.233.63.91 192.168.101.51 80:30080/TCP 32h
Gateway-svc-external LoadBalancer 10.233.38.193 192.168.101.52 8443:30443/TCP 32h
Master-svc-external LoadBalancer 10.233.19.51 192.168.101.53 1433:31433/TCP 32h
Mgmtproxy-svc-external LoadBalancer 10.233.12.208 192.168.101.54 8080:30777/TCP 32h
azdata
Connecting To The Cluster

 Storage pool: curl, Azure Data Factory, kubectl cp or HDFS tiering
 Data pool: any TDS tool / client or external tables
Data Ingestion

 Create a Kubernetes cluster
 Install Python
 Install azdata
 Setup environment vars or
JSON
 Create cluster with azdata
Creating A Big Data Cluster: One Way

 Install Python
 Install azdata
 az login (get account id)
 Download and run
deploy-sql-big-data-aks.py
. . . The Fastest Way

 Worker node hosts running out of
space for docker images
 docker rmi is your friend !!!
 Using old versions of mssqlctl
 mssqlctl –version to the rescue
Main Gotches I Ran Into

 Kubernetes
+ storage
= source of great confusion
 Ephemeral storage
= great for kicking the tyres
 What about production grade
installations ?
IBM RAMAC
worlds first commercial hard drive
A Word On Storage

 Data protection
 Security
 Elasticity
 Licensing
Missing Pieces Of The Puzzle

From upgrade to a new release
Data Protection

Other Cool Stuff
 Deploy and consume apps:
R, Python, Mleap and SSIS,
 Data wrangling with PROSE code
accelerator,
 Spark machine learning with Mleap,
 Sparklyr,
 Bulk processing in Spark.

 New York Taxi dataset (S3)
imported via HDFS tiering
 Data analysed with a python
notebook in Azure Data Studio
 External table created for
analysis of the data using
Transact-SQL
Recorded Demo
https://www.youtube.com/watch?v=JXKvCLhKSw8

https://github.com/microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters
Take A Picture Of Me

 Official documentation
 Microsoft workshops
 Anything from Kelsey Hightower on youtube.com
Where To Next ?

Contact Details
•
•
•
•
ChrisAdkin8
cadkin@purestorage.com
http://uk.linkedin.com/in/wollatondba

#DataRelay@DataRelay_UK DataRelay.co.uk
https://datarelay.co.uk/Feedback
Feedback
PLATINUM
BRONZE
GOLD

Data relay introduction to big data clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data relay introduction to big data clusters

Similaire à Data relay introduction to big data clusters (20)

Plus de Chris Adkin

Plus de Chris Adkin (16)

Dernier

Dernier (20)

Data relay introduction to big data clusters