The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
1. Lessons from building large-scale,
multi-cloud, SaaS software at
Databricks
Jeff Pang
Principal Software Engineer @
2. Who am I?
▪ Jeff Pang
Principal Software Engineer, Databricks
▪ Databricks Platform Engineering
To help data teams solve the world’s toughest problems,
the Databricks Platform team provides the world-class,
multi-cloud platform that enables us to expand fast and
iterate quickly
http://databricks.com/careers
3. About
▪ Founded in 2013 by the original creators of Apache Spark
▪ Data and AI platform as a service for 5000+ customers
▪ 1000+ employees, 200+ engineers, >$200M annual recurring revenue
5. Agenda
The architecture
Inside the Unified Analytics Platform
Challenges & lessons
Growing a SaaS data platform
Operating on multiple clouds
Accelerating a data platform with data & AI
7. Simple data engineering architecture
cluster
Reporting
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON,
TXT…
Data Lake
S3, HDFS,
Blob Store, etc.
8. Modern data engineering architecture
Data Lake
Reporting,
Notebooks, AI
Streaming
Analytics
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Workflow scheduling
clusters
Cluster management
9. Customer Network
Multiply by thousands of customers...
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
...
control plane
Collaborative Notebooks, AI
Streaming
Analytics Workflow scheduling Cluster management Admin & Security
Reporting,
Business Insights
13. That’s the Databricks control plane
What did we learn from building a large-scale, multi-cloud data platform?
100,000s of users 100,000s of Spark clusters per day
Millions of VMs launched per day
Exabytes of data processed per
day
15. Evolution of the Databricks control plane
We didn’t start with a global-scale, multi-cloud data platform
Challenge: Scaling a data platform from one customer to 5000+
Lesson: The factory that builds and evolves the data platform is more
important than the data platform itself
16. Fast time to market
Databricks control plane “in-a-box”
▪ Need to deliver value quickly
▪ Need to iterate quickly
▪ Can’t break things while iterating!
Keys to success:
▪ Modern CI
▪ Fast developer tools
▪ Testing, testing, testing
V1 V2
25-500x
Scala
build
speedups
10s of
millions of
tests per
day
100s of
Databrick
s
“in-a-box”
test envs
per day
17. Expand the total addressable market
Replicating control planes quickly
▪ Need different configurations for
different environments
▪ Need to update many environments
▪ Can’t slow down platform
development!
Keys to success:
▪ Declarative infrastructure
▪ Modern CD infrastructure jsonnet
10 million
lines
250k
lines
18. Service Framework
Land and expand workloads
Scaling the control plane
▪ Need to support more users &
workloads
▪ Need to build more features that scale
▪ Don’t want devs to reinvent the wheel!
Keys to success:
▪ A service framework to do the hard
stuff
▪ Decompose monoliths to microservices
Container & replica management, APIs & RPCs, rate
limits, metrics, logging, secrets & security, ...
Cloud
VM API
Cluster
Manager
Customer Clusters
version 1
Cloud
VM API
CM Master
Customer Clusters
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
version 3
usage
19. Data Platform
The Databricks data platform factory
...
Customer Network Customer Network Customer Network Customer Network Customer Network
Kubernetes
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing,
...
Envoy, GraphQL
Cloud VMs, network, storage, databases
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
21. Why multi-cloud?
The data platform needs to be where the data is
▪ Performance, latency, egress data costs
▪ Cloud-specific integrations
▪ Data governance policies
Challenge: Supporting multiple clouds without sacrificing dev velocity
Lesson: A cloud-agnostic layer is key to dev velocity, but it also needs to
integrate with the standards of each cloud and deal with their quirks
22. Challenge: dev velocity on multiple clouds
Many cloud services have no
direct equivalents
▪ DynamoDB vs ?
▪ CosmosDB vs ?
▪ Aurora vs ?
▪ SQL DW vs ?
Cloud APIs don’t look like
each other
▪ SDK: no common interfaces
▪ Auth: IAM vs AAD
▪ ACLs: IAM vs Azure RBAC
APIs?Services?
Operational tools for each
cloud are very different
▪ Templates: CloudFormation
vs ARM templates
▪ Logs: CloudWatch vs Azure
Monitor
Ops?
23. Approach: cloud agnostic dev framework
Use lowest common denominator cloud services
EKS ←Kubernetes →AKS
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing,
...
Envoy
EC2
VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
Azure Compute
VNet
Azure Database for MySQL/Postgres
≈
≈
≈
ELB Azure Load Balancer
Service
framework API
≈
24. Challenge: not everything can be cloud agnostic
Customers want
to integrate with
the standards of
each cloud
“Equivalent”
cloud services
have
implementation
quirks
25. Approach: abstraction layer for key integrations
Fargate ←Kubernetes →AKS
Bring your own key encryption
AuthN / AuthZ / Identity
EC2
VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
Azure Compute
VNet
Azure Database for MySQL/Postgres
≈
≈
≈
Okta, OneLogin, etc.
Azure Active Directory
IAM roles
KMS Azure Key Vault
Unified usage service
AWS Marketplace, Custom
Billing
Azure Commerce Billing
ELB Azure Load Balancer≈
Databricks file systemS3 Azure Storage
S3 commit service
26. Approach: harmonize “equivalent” cloud service quirks
Promise of elastic compute
is unevenly distributed
▪ Provisioning speed differs
▪ Deletion speed differs
(speed to refill quota)
→ Need to adapt to cloud
resource and API limits
TCP connections are hard
▪ “Invisible” NATs have
connection & timeout limits
→ Need tuned keep alive,
connection limit configs
▪ Kernel TCP SACK bug caused
API hangs in one cloud only
→ Need to deep robustness
testing against both clouds
(ex: poor NIC reliability)
NetworkVirtual machines
When MySQL != MySQL
▪ Host OS matters
Ex: case sensitivity defaults
▪ Default DB params matter
Ex: tablespace config → 100x
difference in recovery time
→ Need expertise in DB
tuning to ensure equivalence
Databases
28. Inception: Improving a data platform with data & AI
We are one of our biggest customers
Challenge: Building a data platform is hard without a data platform
▪ Need data to track usage, maintain security
▪ Need data to observe and improve how users use the data platform
▪ Need data to keep the data platform up and running
Lesson: Data & AI can accelerate data platform features, product
analytics, and devops
29. How we use Databricks to accelerate itself
Key platform features
▪ Usage and billing reports
▪ Audit logs
Essential product analytics
▪ Feature usage, trends, prediction
▪ Growth and churn forecast, models
Mission critical devops
▪ Service KPIs and SLAs
▪ API and application structured logs
▪ Spark debug logs
30. Data foundation & analytics
Our distributed data pipelines
100s of TB
logs per
day
Millions of
time
series per
secondTime-series, raw logs,
request tracing,
dashboards
Kinesis Event Hubs
Declarative data
pipeline deployments
Real-time streaming
31. Takeaways
The architecture
Managing millions of VMs around the world in
multiple clouds
Challenges & lessons
The factory that builds and evolves the data
platform is more important than the data platform
itself
A cloud-agnostic platform that integrates with
cloud standards and quirks is the key to
multi-cloud
Data & AI accelerates data platform features,
product analytics, and devops
Join us!
http://databricks.com/careers
34. 34
Our Product
Built around
open source:
Interactive
data science
Scheduled jobs
SQL frontend
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
Databricks Runtime
Customer’s Cloud AccountDatabricks Service
40. Reduce Long Titles
▪ Bullet 1
▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 2
▪ Sub-bullet
▪ Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area
41. Two Columns
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
Headline FormatHeadline Format
42. Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
43. Three Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category
44. Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category
▪ Bulleted list
▪ Bulleted list
Category
48. Table
Column Column Column
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value