Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

Lessons from building large-scale,
multi-cloud, SaaS software at
Databricks
Jeff Pang
Principal Software Engineer @

Who am I?
▪ Jeff Pang
Principal Software Engineer, Databricks
▪ Databricks Platform Engineering
To help data teams solve the world’s toughest problems,
the Databricks Platform team provides the world-class,
multi-cloud platform that enables us to expand fast and
iterate quickly
http://databricks.com/careers

About
▪ Founded in 2013 by the original creators of Apache Spark
▪ Data and AI platform as a service for 5000+ customers
▪ 1000+ employees, 200+ engineers, >$200M annual recurring revenue

Our product
Data scientists Data engineers Business users

Agenda
The architecture
Inside the Uniﬁed Analytics Platform
Challenges & lessons
Growing a SaaS data platform
Operating on multiple clouds
Accelerating a data platform with data & AI

The architecture
Inside the Uniﬁed Analytics Platform

Simple data engineering architecture
cluster
Reporting
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON,
TXT…
Data Lake
S3, HDFS,
Blob Store, etc.

Modern data engineering architecture
Data Lake
Reporting,
Notebooks, AI
Streaming
Analytics
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Workﬂow scheduling
clusters
Cluster management

Customer Network
Multiply by thousands of customers...
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
Customer Network
Data Lake
CSV,
JSON,
TXT…
Kinesis
...
control plane
Collaborative Notebooks, AI
Streaming
Analytics Workflow scheduling Cluster management Admin & Security
Reporting,
Business Insights

→ millions of VMs managed per day

That’s the Databricks control plane
What did we learn from building a large-scale, multi-cloud data platform?
100,000s of users 100,000s of Spark clusters per day
Millions of VMs launched per day
Exabytes of data processed per
day

Evolution of the Databricks control plane
We didn’t start with a global-scale, multi-cloud data platform
Challenge: Scaling a data platform from one customer to 5000+
Lesson: The factory that builds and evolves the data platform is more
important than the data platform itself

Fast time to market
Databricks control plane “in-a-box”
▪ Need to deliver value quickly
▪ Need to iterate quickly
▪ Can’t break things while iterating!
Keys to success:
▪ Modern CI
▪ Fast developer tools
▪ Testing, testing, testing
V1 V2
25-500x
Scala
build
speedups
10s of
millions of
tests per
day
100s of
Databrick
s
“in-a-box”
test envs
per day

Expand the total addressable market
Replicating control planes quickly
▪ Need different conﬁgurations for
different environments
▪ Need to update many environments
▪ Can’t slow down platform
development!
Keys to success:
▪ Declarative infrastructure
▪ Modern CD infrastructure jsonnet
10 million
lines
250k
lines

Service Framework
Land and expand workloads
Scaling the control plane
▪ Need to support more users &
workloads
▪ Need to build more features that scale
▪ Don’t want devs to reinvent the wheel!
Keys to success:
▪ A service framework to do the hard
stuff
▪ Decompose monoliths to microservices
Container & replica management, APIs & RPCs, rate
limits, metrics, logging, secrets & security, ...
Cloud
VM API
Cluster
Manager
Customer Clusters
version 1
Cloud
VM API
CM Master
Customer Clusters
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
version 3
usage

Data Platform
The Databricks data platform factory
...
Customer Network Customer Network Customer Network Customer Network Customer Network
Kubernetes
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing,
...
Envoy, GraphQL
Cloud VMs, network, storage, databases
CM Master
Worker Worker
API Server
CM MasterCM Shard

Why multi-cloud?
The data platform needs to be where the data is
▪ Performance, latency, egress data costs
▪ Cloud-speciﬁc integrations
▪ Data governance policies
Challenge: Supporting multiple clouds without sacriﬁcing dev velocity
Lesson: A cloud-agnostic layer is key to dev velocity, but it also needs to
integrate with the standards of each cloud and deal with their quirks

Challenge: dev velocity on multiple clouds
Many cloud services have no
direct equivalents
▪ DynamoDB vs ?
▪ CosmosDB vs ?
▪ Aurora vs ?
▪ SQL DW vs ?
Cloud APIs don’t look like
each other
▪ SDK: no common interfaces
▪ Auth: IAM vs AAD
▪ ACLs: IAM vs Azure RBAC
APIs?Services?
Operational tools for each
cloud are very different
▪ Templates: CloudFormation
vs ARM templates
▪ Logs: CloudWatch vs Azure
Monitor
Ops?

Approach: cloud agnostic dev framework
Use lowest common denominator cloud services
EKS ←Kubernetes →AKS
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing,
...
Envoy
EC2
VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
Azure Compute
VNet
Azure Database for MySQL/Postgres
≈
≈
≈
ELB Azure Load Balancer
Service
framework API
≈

Challenge: not everything can be cloud agnostic
Customers want
to integrate with
the standards of
each cloud
“Equivalent”
cloud services
have
implementation
quirks

Approach: abstraction layer for key integrations
Fargate ←Kubernetes →AKS
Bring your own key encryption
AuthN / AuthZ / Identity
EC2
VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
Azure Compute
VNet
Azure Database for MySQL/Postgres
≈
≈
≈
Okta, OneLogin, etc.
Azure Active Directory
IAM roles
KMS Azure Key Vault
Uniﬁed usage service
AWS Marketplace, Custom
Billing
Azure Commerce Billing
ELB Azure Load Balancer≈
Databricks ﬁle systemS3 Azure Storage
S3 commit service

Approach: harmonize “equivalent” cloud service quirks
Promise of elastic compute
is unevenly distributed
▪ Provisioning speed differs
▪ Deletion speed differs
(speed to refill quota)
→ Need to adapt to cloud
resource and API limits
TCP connections are hard
▪ “Invisible” NATs have
connection & timeout limits
→ Need tuned keep alive,
connection limit configs
▪ Kernel TCP SACK bug caused
API hangs in one cloud only
→ Need to deep robustness
testing against both clouds
(ex: poor NIC reliability)
NetworkVirtual machines
When MySQL != MySQL
▪ Host OS matters
Ex: case sensitivity defaults
▪ Default DB params matter
Ex: tablespace config → 100x
difference in recovery time
→ Need expertise in DB
tuning to ensure equivalence
Databases

Accelerating a data platform
with data & AI

Inception: Improving a data platform with data & AI
We are one of our biggest customers
Challenge: Building a data platform is hard without a data platform
▪ Need data to track usage, maintain security
▪ Need data to observe and improve how users use the data platform
▪ Need data to keep the data platform up and running
Lesson: Data & AI can accelerate data platform features, product
analytics, and devops

How we use Databricks to accelerate itself
Key platform features
▪ Usage and billing reports
▪ Audit logs
Essential product analytics
▪ Feature usage, trends, prediction
▪ Growth and churn forecast, models
Mission critical devops
▪ Service KPIs and SLAs
▪ API and application structured logs
▪ Spark debug logs

Data foundation & analytics
Our distributed data pipelines
100s of TB
logs per
day
Millions of
time
series per
secondTime-series, raw logs,
request tracing,
dashboards
Kinesis Event Hubs
Declarative data
pipeline deployments
Real-time streaming

Takeaways
The architecture
Managing millions of VMs around the world in
multiple clouds
Challenges & lessons
The factory that builds and evolves the data
platform is more important than the data platform
itself
A cloud-agnostic platform that integrates with
cloud standards and quirks is the key to
multi-cloud
Data & AI accelerates data platform features,
product analytics, and devops
Join us!
http://databricks.com/careers

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

34
Our Product
Built around
open source:
Interactive
data science
Scheduled jobs
SQL frontend
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
Databricks Runtime
Customer’s Cloud AccountDatabricks Service

Basic Slide
▪ Bullet 1
▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 2
▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 3
▪ Sub-bullet
▪ Sub-bullet

Reduce Long Titles
▪ Bullet 1
▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 2
▪ Sub-bullet
▪ Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a
subtitle area

Two Columns
▪ Bulleted list format
Headline FormatHeadline Format

Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory

Three Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category

Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category
▪ Bulleted list
▪ Bulleted list
Category

Shapes
Pill-shaped rectangle Double corner
rectangle
Double corner
rectangle

Table
Column Column Column
Row Value Value Value

Attribution Format
Second line of attribution
This is a template for a quote slide.
This is where the quote goes.
Attribute the source below…

Databricks simpliﬁes data and AI
so data teams can innovate faster

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

Similaire à Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks