SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Azure Databricks
Customer Experiences
and Lessons
Denzil Ribeiro & Madhu Ganta
Microsoft
Objectives
• Understand customer deployment of Azure Databricks
• Understand customer integration requirements on Azure
platform
• Best practices on Azure Databricks
Azure Databricks Deployment
User resource group
Databricks locked resource group
DBFS
Databricks
control plane
Rest Endpoint
https://region.azuredat
abricks.net
Hive metastore
Vnet Peering
Worker Worker Worker Worker
Driver
Cluster
What VM should I pick for my workload?
VM SKU Impact
• Different VM SKU can be chosen for driver v/s workers
• Cores (Impacts parallelism)
• Memory (Impacts JVM heap)
• Local SSD Size & IOPS Limits (Impacts DBIO cache & spills)
• Attached Disk Throughput Limits (Impacts DBIO cache & spills)
• Network Throughput Limits(Impacts shuffle)
SKU vCPU Memory
(GiB)
Local
SSD
Size
(GiB)
Max Cached
Local SSD
Throughput
(IOPS / MBps)
Max uncached
attached disk
Throughput
(IOPS / MBps)
Max NICs /
network
bandwidth
(Mbps)
Standard_DS3_v2 4 14 28 16,000 / 128 12,800 / 192 4 / 3000
Standard_DS14 16 112 224 64,000 / 512 51,200 / 512 8 / 8000
Standard_L16s 16 128 2,807 80,000 / 800 20,000 / 500 8 / 16,000
Where can I store my data?
• Built-in Distributed File system tied to workspace
• Layer on top of Azure storage - default setup gives no “control” outside of
databricks workspace.
• Other file systems can be mounted on to DBFS
DBFS
• Managed azure service providing highly redundant scalable, secure storage
• Data can be accessed via storage Keys or SAS tokens
• V2 storage accounts have higher limits
Azure Blob Storage
• Databricks can access data using Azure Active Directory Authentication and
Authorization
• Allows ACLs on data secured via Azure Service Principal Names (SPNs)
• Region availability less than Azure Blob storage
Azure Data Lake Store
How do I mount external storage?
Python Scala CLI dbutils
DBFS
Azure Blob storage Azure Data Lake Store
Integration experiences on Azure
How do I optimally load or query data into SQL Server?
Load data to Azure SQL Database or SQL Server Instance
Use SQL Spark Connector rather than JDBC
Sample bulk load performance
• Specify TABLOCK for heaps
• Specify Batch size of 102,400 or
greater for columnstore tables
• No TABLOCK for columnstore tables
• More details : Turbo boost data loads
from Spark using SQL Spark connector
JDBC V/s SQL Connector comparison in data loading
How do I read large datasets from SQL Data warehouse?
Azure Storage Blob
Driver
Worker Worker Worker
Control
Node
Compute
Node
Compute
Node
Compute
Node
DF.Load() JDBC Connection
Parquet
Parquet
Parquet
tmpDir
JDBC connection string
DW Table
Azure Blob Storage directory
Spark cluster SQL DW
Read
Write
How do I load large datasets into SQL Data warehouse?
Azure Storage Blob
Driver
Worker Worker Worker
Control
Node
Compute
Node
Compute
Node
Compute
Node
DF.save() JDBC Connection
Parquet
Parquet
Parquet
tmpDir
Pre and Post
Execution Steps
Create DB Scoped Credential
Create External Data Source
Create External File Format
CTAS with column projection
Spark cluster SQL DW
Write Read
Create a
Database Master
Key
How do I optimally read data from CosmosDB?
▪ With Spark to Azure ComosDB connector,
Spark can now interact with all Cosmos DB
data models: Documents, Tables, and Graphs
• Exploits the native Azure Cosmos DB managed
indexes
• Utilizes push-down predicate filtering on fast-
changing globally-distributed data
How do I optimally bulk load data into CosmosDB?
• Use Bulk API
• Default Batch size of 500 is low.
Consider increasing to 100,000
• Configure indexing to lazy (*note
reads maybe inconsistent for a while)
• Monitor and Increase RU’s as needed
• Try to map Spark partition setup to
CosmosDB partitioning
• Use a larger page size to reduce
network roundtrips
How do I build a streaming solution on Azure?
Connect to Azure Event Hub from a Spark Cluster using the Event Hub Spark Connector
1. Setup connection to Event Hub 2. Read data from Event Hub
Event hubs for Kafka is in preview with Kafka API (1.0.0) calls
Event Hub auto-inflate feature can scale up to handle load increases
Ingest, Wrangle Data
& Boost productivity
with Spark
Apps + insights
Social
LOB
Graph
IoT
Image
CRM
Azure Databricks
IoT
Azure Machine
Learning SDK on
Databricks
Azure Batch
AKS
Agile development at
scale with Python,
Spark, Deep
Learning
Easily deploy,
Version, Manage &
Monitor AI Models in
parallel at scale
Prepare Data Build & Train Deploy
How do I operationalize my ML model?
How do I operationalize my ML model?
1. Provision Azure Kubernetes Service 2. Deploy the model web service
Azure ML API to serialize and deploy the model from your Databricks notebook
Azure container services
How do I orchestrate my data pipeline?
On Premises Data Cloud Svcs & Data
Azure Blob storage Databricks Job
Send
Email
PowerBI Dashboards
Azure Data Factory Pipeline
• ADF Enables orchestration of Data
Pipeline
• Copy of Data
• Transformation of data
• Control flow enables conditional
logic
• Securely store encrypted
credentials
• Scheduling of Pipeline
Best Practices
Security best practices
• Provides most isolation
• Ensure User Access Control and Cluster Access control is enabled
Databricks Workspace
• Use different SAS tokens with different permission sets for Blob Storage
• Can mount different ADLS folders with different Access permissions
• Use ADLS SPNs for permissions on folders
• Encryption is default for all storage accounts
• Table ACLs exist only on clusters allowing python and SQL commands
Storage security
• Use Databricks secrets to store credentials needed in Spark conf file or connection strings
• Secure personal access tokens
Secrets Management
Performance and limits
• VM Size limits (Cores, Networking, IO)
• VM SSD size and shuffle/DBIO Caching
• GPU optimized VM’s for AI and Deep learning models
• Storage IOPs limits and monitoring for Azure storage
• Region Awareness (co-location with other Azure Services)
Performance Considerations
• Azure Event Hub auto-inflate settings
• Databricks cluster auto-scale
• CosmosDB request units for increased throughput
• Managing Data warehouse compute
• Azure SQL database service tier
Service scalability
• Default subscription quotas on large clusters (Cores/IP)
Quotas and Limits
Operational Best practices
• Custom log delivery to a DBFS location
• Use a mount point to store logs on Azure blob storage
Cluster Logs location
• Workspace API export, instantiate new workspace
• Storage replication
DR site
• Monitor DBIO cache size and pick the appropriate worker node types
• Can enable OMS monitoring with init script associating cluster VM’s to
an OMS workspace
Monitoring
Sample OMS monitoring chart
Q&A
Thank You!!!

Contenu connexe

Tendances

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Cathrine Wilhelmsen
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPDatabricks
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with SnowflakeMatillion
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesEric Kavanagh
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
Hadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxHadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxyashodhannn
 

Tendances (20)

Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
Lessons Learned: Implementing Azure Synapse Analytics in a Rapidly-Changing S...
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Data Migration to Azure
Data Migration to AzureData Migration to Azure
Data Migration to Azure
 
Hadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxHadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptx
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 

Similaire à Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta

Azure - Data Platform
Azure - Data PlatformAzure - Data Platform
Azure - Data Platformgiventocode
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...Amazon Web Services
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmurTobias Koprowski
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesCCG
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VMJames Serra
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...Rustem Feyzkhanov
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAmazon Web Services
 
Microsoft Azure News - 2018 December
Microsoft Azure News - 2018 DecemberMicrosoft Azure News - 2018 December
Microsoft Azure News - 2018 DecemberDaniel Toomey
 
More Cache for Less Cash
More Cache for Less CashMore Cache for Less Cash
More Cache for Less CashMichael Collier
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWSAWS Germany
 
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"DataConf
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 

Similaire à Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta (20)

Azure - Data Platform
Azure - Data PlatformAzure - Data Platform
Azure - Data Platform
 
04 Azure IAAS 101
04 Azure IAAS 10104 Azure IAAS 101
04 Azure IAAS 101
 
A to z for sql azure databases
A to z for sql azure databasesA to z for sql azure databases
A to z for sql azure databases
 
IaaS azure_vs_amazon
IaaS azure_vs_amazonIaaS azure_vs_amazon
IaaS azure_vs_amazon
 
AWS Webcast - Website Hosting
AWS Webcast - Website HostingAWS Webcast - Website Hosting
AWS Webcast - Website Hosting
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
AWS Webcast - AWS Webinar Series for Education #3 - Discover the Ease of AWS ...
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
 
Afternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data ServicesAfternoons with Azure - Azure Data Services
Afternoons with Azure - Azure Data Services
 
AZURE Data Related Services
AZURE Data Related ServicesAZURE Data Related Services
AZURE Data Related Services
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Implement SQL Server on an Azure VM
Implement SQL Server on an Azure VMImplement SQL Server on an Azure VM
Implement SQL Server on an Azure VM
 
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
DataTalks.Club - Building Scalable End-to-End Deep Learning Pipelines in the ...
 
AWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS CloudAWS Webcast - Build Agile Applications in AWS Cloud
AWS Webcast - Build Agile Applications in AWS Cloud
 
Microsoft Azure News - 2018 December
Microsoft Azure News - 2018 DecemberMicrosoft Azure News - 2018 December
Microsoft Azure News - 2018 December
 
More Cache for Less Cash
More Cache for Less CashMore Cache for Less Cash
More Cache for Less Cash
 
Migrating Oracle Databases to AWS
Migrating Oracle Databases to AWSMigrating Oracle Databases to AWS
Migrating Oracle Databases to AWS
 
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Dernier

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta

  • 1. Azure Databricks Customer Experiences and Lessons Denzil Ribeiro & Madhu Ganta Microsoft
  • 2. Objectives • Understand customer deployment of Azure Databricks • Understand customer integration requirements on Azure platform • Best practices on Azure Databricks
  • 3. Azure Databricks Deployment User resource group Databricks locked resource group DBFS Databricks control plane Rest Endpoint https://region.azuredat abricks.net Hive metastore Vnet Peering Worker Worker Worker Worker Driver Cluster
  • 4. What VM should I pick for my workload? VM SKU Impact • Different VM SKU can be chosen for driver v/s workers • Cores (Impacts parallelism) • Memory (Impacts JVM heap) • Local SSD Size & IOPS Limits (Impacts DBIO cache & spills) • Attached Disk Throughput Limits (Impacts DBIO cache & spills) • Network Throughput Limits(Impacts shuffle) SKU vCPU Memory (GiB) Local SSD Size (GiB) Max Cached Local SSD Throughput (IOPS / MBps) Max uncached attached disk Throughput (IOPS / MBps) Max NICs / network bandwidth (Mbps) Standard_DS3_v2 4 14 28 16,000 / 128 12,800 / 192 4 / 3000 Standard_DS14 16 112 224 64,000 / 512 51,200 / 512 8 / 8000 Standard_L16s 16 128 2,807 80,000 / 800 20,000 / 500 8 / 16,000
  • 5. Where can I store my data? • Built-in Distributed File system tied to workspace • Layer on top of Azure storage - default setup gives no “control” outside of databricks workspace. • Other file systems can be mounted on to DBFS DBFS • Managed azure service providing highly redundant scalable, secure storage • Data can be accessed via storage Keys or SAS tokens • V2 storage accounts have higher limits Azure Blob Storage • Databricks can access data using Azure Active Directory Authentication and Authorization • Allows ACLs on data secured via Azure Service Principal Names (SPNs) • Region availability less than Azure Blob storage Azure Data Lake Store
  • 6. How do I mount external storage? Python Scala CLI dbutils DBFS Azure Blob storage Azure Data Lake Store
  • 8. How do I optimally load or query data into SQL Server? Load data to Azure SQL Database or SQL Server Instance Use SQL Spark Connector rather than JDBC
  • 9. Sample bulk load performance • Specify TABLOCK for heaps • Specify Batch size of 102,400 or greater for columnstore tables • No TABLOCK for columnstore tables • More details : Turbo boost data loads from Spark using SQL Spark connector JDBC V/s SQL Connector comparison in data loading
  • 10. How do I read large datasets from SQL Data warehouse? Azure Storage Blob Driver Worker Worker Worker Control Node Compute Node Compute Node Compute Node DF.Load() JDBC Connection Parquet Parquet Parquet tmpDir JDBC connection string DW Table Azure Blob Storage directory Spark cluster SQL DW Read Write
  • 11. How do I load large datasets into SQL Data warehouse? Azure Storage Blob Driver Worker Worker Worker Control Node Compute Node Compute Node Compute Node DF.save() JDBC Connection Parquet Parquet Parquet tmpDir Pre and Post Execution Steps Create DB Scoped Credential Create External Data Source Create External File Format CTAS with column projection Spark cluster SQL DW Write Read Create a Database Master Key
  • 12. How do I optimally read data from CosmosDB? ▪ With Spark to Azure ComosDB connector, Spark can now interact with all Cosmos DB data models: Documents, Tables, and Graphs • Exploits the native Azure Cosmos DB managed indexes • Utilizes push-down predicate filtering on fast- changing globally-distributed data
  • 13. How do I optimally bulk load data into CosmosDB? • Use Bulk API • Default Batch size of 500 is low. Consider increasing to 100,000 • Configure indexing to lazy (*note reads maybe inconsistent for a while) • Monitor and Increase RU’s as needed • Try to map Spark partition setup to CosmosDB partitioning • Use a larger page size to reduce network roundtrips
  • 14. How do I build a streaming solution on Azure? Connect to Azure Event Hub from a Spark Cluster using the Event Hub Spark Connector 1. Setup connection to Event Hub 2. Read data from Event Hub Event hubs for Kafka is in preview with Kafka API (1.0.0) calls Event Hub auto-inflate feature can scale up to handle load increases
  • 15. Ingest, Wrangle Data & Boost productivity with Spark Apps + insights Social LOB Graph IoT Image CRM Azure Databricks IoT Azure Machine Learning SDK on Databricks Azure Batch AKS Agile development at scale with Python, Spark, Deep Learning Easily deploy, Version, Manage & Monitor AI Models in parallel at scale Prepare Data Build & Train Deploy How do I operationalize my ML model?
  • 16. How do I operationalize my ML model? 1. Provision Azure Kubernetes Service 2. Deploy the model web service Azure ML API to serialize and deploy the model from your Databricks notebook Azure container services
  • 17. How do I orchestrate my data pipeline? On Premises Data Cloud Svcs & Data Azure Blob storage Databricks Job Send Email PowerBI Dashboards Azure Data Factory Pipeline • ADF Enables orchestration of Data Pipeline • Copy of Data • Transformation of data • Control flow enables conditional logic • Securely store encrypted credentials • Scheduling of Pipeline
  • 19. Security best practices • Provides most isolation • Ensure User Access Control and Cluster Access control is enabled Databricks Workspace • Use different SAS tokens with different permission sets for Blob Storage • Can mount different ADLS folders with different Access permissions • Use ADLS SPNs for permissions on folders • Encryption is default for all storage accounts • Table ACLs exist only on clusters allowing python and SQL commands Storage security • Use Databricks secrets to store credentials needed in Spark conf file or connection strings • Secure personal access tokens Secrets Management
  • 20. Performance and limits • VM Size limits (Cores, Networking, IO) • VM SSD size and shuffle/DBIO Caching • GPU optimized VM’s for AI and Deep learning models • Storage IOPs limits and monitoring for Azure storage • Region Awareness (co-location with other Azure Services) Performance Considerations • Azure Event Hub auto-inflate settings • Databricks cluster auto-scale • CosmosDB request units for increased throughput • Managing Data warehouse compute • Azure SQL database service tier Service scalability • Default subscription quotas on large clusters (Cores/IP) Quotas and Limits
  • 21. Operational Best practices • Custom log delivery to a DBFS location • Use a mount point to store logs on Azure blob storage Cluster Logs location • Workspace API export, instantiate new workspace • Storage replication DR site • Monitor DBIO cache size and pick the appropriate worker node types • Can enable OMS monitoring with init script associating cluster VM’s to an OMS workspace Monitoring