Workday uses Apache Spark as the foundational technology for its Prism Analytics product. It has developed a custom Spark upgrade model to handle upgrading Spark across its multi-tenant environment. Workday also collects runtime metrics on Spark SQL queries using a custom metrics pipeline and REST API. Future plans include upgrading to Spark 3.x and improving multi-tenancy support through a "Multiverse" deployment model.
2. •What is Workday?
•“Power of One” and Prism Analytics
•How Apache Spark fits in?
•Custom Spark Upgrade Model
•Runtime Metrics Pipeline
•What is the next?
Agenda
3. • FY20 Revenue $3.6B
• ~28% Y/Y Growth
• >7,700 customers
• >45% of Fortune 500
• >12,300 employees
• NASDAQ: WDAY
About Workday
Enterprise Business Applications for a Changing World
• Human Capital, Financials, Planning,
Analytics
• Cloud native, multi-tenant
• 30% revenue re-invested in product
each year
• >40 Advisory Partners
• >200 Software Partners
Planning
Financial
Management
Human
Capital Management
Analytics & Benchmarking
6. Durable
Object Data Model
MetadataExtensible
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
7. Security
Encryption Privacy and
Compliance
Trust
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
8. Reporting and Analytics
ExploratoryDescriptive
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Augmented
9. The Leading Enterprise Cloud for Finance and HR
37 Million +
workers
100 Billion +
transactions per year
96.1%
transactions < 1
seconds
99.9%
actual availability
200+
companies
#1
Future 50, Fortune
#2
40 Best Workplaces in
Technology, Fortune
10 Thousand +
certified resources
in the ecosystem
11. Financial Employees
GL HR &
Payroll
Third-Party
HR & FIN
Industry &
Homegrown
CRM Marketing Service Subsidiaries Contract
Labor
Workday Maintains Your Data Gravity
12. Workday Prism Analytics
The full spectrum of workforce,
financial, and operational
insights, all within Workday.
Workday
Data
Non-Workday
Data
16. Workday in the Cloud
ASH
PDX
ATL
PROD & NPRD
ENG
PROD & NPRD
DR for PDX
SALES
DR for ASH
PROD & NPRD
DUB
AMS
DR for DUB
ORE
MTL
PROD & NPRD
NPRD
COL
PROD
19. With this scale, complexity, dependencies…
How can you do Spark version upgrades?
20. Spark Upgrade challenges:
‒ high number of tenants,
‒ long-running Spark Applications,
‒ progressive roll-out,
‒ rollback case,
‒ maintaining custom Spark fork
Custom Spark Upgrade Model
Custom
Repo
Spark
Version
Custom
Repo
Spark Current
Version
Shim API
Spark Next
Version
Previous Approach
New Approach
Spark single-version support against a single repo
Spark multi-versions support against a single repo
This upgrade model is not specific for Spark upgrade so can be
applied for any internal & external API upgrades when dealing with
these kind of challenges.
This upgrade model is also
used for major and minor
Spark version upgrades.
21. •Remove PII Data from Logs: Spark query plans and DataFrame schema
obfuscation.
•Catalyst Optimizer: Additional optimization rules on aggregation and large
case statements optimizations.
•Extension for Physical Plan: Enable correlation between Physical Operators
and their runtime metrics.
•Rest APIs: SQL Rest API improvements to query and aggregate physical
operation level metrics.
•Benchmark Module: Additional module to run benchmark tests on introduced
new Spark patches by using standard TPCH and custom queries.
Custom Spark Release Preparation
22. Shim API
SparkShim
Interface
SparkShimImpl
for Spark v2.3.0
SparkShimImpl
for Spark v2.4.4
Spark API diffs between
both versions may introduce
both compile-time(e.g: Invalid type) and/or
runtime issues (e.g: NoSuchMethodError)
23. Compile-time & Runtime Version Selections
Classpath Types Description
Compile
-Time
compileClasspath +
testCompileClasspath
Spark compile-time version is
the current version.
Runtime runtimeClasspath +
testRuntimeClasspath
Spark runtime version is
selected by feature toggle as
current or next version.
A sample Gradle build script code snippet on selections of
both Spark and Shim compile-time and runtime classpath versions:Selected Spark versions by classpath types:
Feature Toggle is being used to select Spark version on:
- Build Time (runtime version selection for classpath)
- Test Pipelines (to run UT, IT and Perf Tests by Spark version)
- Environment (to enable Spark version at env level – test, preprod or prod)
Shim API artifacts are shipped in addition to Spark artifacts (by version)
24. Verification & Progressive Roll-out & Cleanup
Progressive Roll-out Phase
WAVE III
Scope: All Tenants (Internal/Impl/Prod)
Duration: 4 Weeks
WAVE II
Scope: Multiple Tenants (Impl / NonProd)
Duration: 2 Weeks
WAVE I
Scope: Single Tenant (Internal)
Duration: 2 Weeks
Verification Phase
Verify following test pipelines against to both Spark
versions:
• Automated Regression Testing: Running Unit &
Integration Test Pipelines
• Performance Testing:
‒ Spark Benchmark Pipeline: Spark current vs
new version Perf Tests (by executing standard
TPCH and custom queries.) + Hadoop
‒ End2End Perf Pipeline: Custom applications +
Spark + Hadoop
Previous Spark version:
‒ Fork,
‒ Artifacts from artifactory /
mvn repository)
Shim API
Cleanup Phase
25. Spark SQL Engine - Query Planning & Execution
SQL
Dataset
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan Physical
Plan
CostModel
Selected
Physical
Plan
DAG
Execution
SQL
Metrics
Application Job Stage Task
Spark UI Rest APIs Event Logs
Logical Planning Physical Planning Execution
Analysis Optimizations Physical Plans
Generation
27. New Spark SQL Rest API [coming with v3.1.0]
New SQL Rest Endpoints
28. Comparison of new Spark SQL Rest API Json Outputs
Improved VersionOlder Version (Cherry-picked from OSS)
Improvements
1. Correlation between
physical operators
and their runtime
metrics
2. wholeStageCodege
nId support across
multiple physical
operators
3. Normalization on
metric values to be
able to run
aggregations
29. Sample Queries on Spark SQL Metrics
What is total loaded
number of input/output
rows by file type, tenant,
application, date?
What are the top 25
tenants running Join, Filter,
Sort (etc..) operations?
What are the mostly used
operations by tenants,
applications, dates?
File Scan Operation
What is number of files
by file type, tenant,
application, date?
What is total scan time
and total metadata time
by min, med, max, file
type, tenant, application,
date?
What is total number of
operations by tenants,
applications, dates?
What are Top 25 Tenants
Having Max Broadcasted
Data Size (GB)?
What is the total number of
joins, BroadcastHashJoin
or SortMergeJoin across all
tenants by day?
Join
What are Top 25 Tenants
Having Max Time to
Collect during Broadcast
(Minute)?
What are Top 25
Tenants Having Max
Time To Broadcast
(Minute) or To Build
during Broadcast
(Minute)?
...
...
...
30. Correlation between Physical Operators & SQL Metrics
Workday Confidential
•We also integrated our physical plans with runtime SQL metrics
•We can have correlation between Physical Operators and their Runtime Metrics from application logs for troubleshooting and debugging purposes
31. Developed patches were also backported to OSS repo for community usage:
•[SPARK-31440][SQL] Improve SQL Rest API
https://github.com/apache/spark/pull/28208
•[SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API
https://github.com/apache/spark/pull/29364
•[SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation
https://github.com/apache/spark/pull/28354
Backported Patches to Spark OSS Repo [v3.1.0]
32. Spark 3.0 introduced following features:
‒ Adaptive Query Execution (SPARK-31412)
‒ Dynamic Partition Pruning (SPARK-11150)
‒ Scala 2.12 Support (SPARK-26132)
‒ JDK 11 Support (SPARK-24417)
‒ Hadoop 3 Support (SPARK-23534)
• Spark 3.x Upgrade (+ Scala, JDK, Hadoop)
• Performance, Troubleshooting and Debugging Improvements
• Multi-Tenancy Support
What is the next?