SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Apache Spark Development Lifecycle @ Workday
Pavel Hardak – Eren Avsarogullari
•What is Workday?
•“Power of One” and Prism Analytics
•How Apache Spark fits in?
•Custom Spark Upgrade Model
•Runtime Metrics Pipeline
•What is the next?
Agenda
• FY20 Revenue $3.6B
• ~28% Y/Y Growth
• >7,700 customers
• >45% of Fortune 500
• >12,300 employees
• NASDAQ: WDAY
About Workday
Enterprise Business Applications for a Changing World
• Human Capital, Financials, Planning,
Analytics
• Cloud native, multi-tenant
• 30% revenue re-invested in product
each year
• >40 Advisory Partners
• >200 Software Partners
Planning
Financial
Management
Human
Capital Management
Analytics & Benchmarking
Planning
Financial
Management
Human Capital
Management
Analytics
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Durable
Object Data Model
MetadataExtensible
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Security
Encryption Privacy and
Compliance
Trust
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Reporting and Analytics
ExploratoryDescriptive
Business Process
Framework
Object
Data Model
Reporting and
Analytics
Security Integration
Cloud
One Source for Data | One Security Model | One Experience | One Community
Machine
Learning
One Platform
Augmented
The Leading Enterprise Cloud for Finance and HR
37 Million +
workers
100 Billion +
transactions per year
96.1%
transactions < 1
seconds
99.9%
actual availability
200+
companies
#1
Future 50, Fortune
#2
40 Best Workplaces in
Technology, Fortune
10 Thousand +
certified resources
in the ecosystem
Planning
Financial
Management
Human Capital
Management
Analytics
Financial Employees
GL HR &
Payroll
Third-Party
HR & FIN
Industry &
Homegrown
CRM Marketing Service Subsidiaries Contract
Labor
Workday Maintains Your Data Gravity
Workday Prism Analytics
The full spectrum of workforce,
financial, and operational
insights, all within Workday.
Workday
Data
Non-Workday
Data
Prism Analytics Momentum - 100% YoY growth
Workday Confidential
Over
Prism Analytics
Customers500
Table
Ingestion
Data Prep
Examples
Engine
Lens Build
Engine
Query Engine
and Mercury
Workday Spark Runtime Engine
Compute (YARN) and Storage (HDFS/S3)
Prism UI and APIs
Accounting
Center
People
Analytics
DBFR Analytics PlatformCosmos DD4A
Apache Spark as foundational technology
HDFS / S3
Prism 01
Tenant 01
Prism 02
Tenant 02
Prism 03
Tenant 03
Prism 04
Tenant 04
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Prism Tenants - Deployment (simplified)
Workday in the Cloud
ASH
PDX
ATL
PROD & NPRD
ENG
PROD & NPRD
DR for PDX
SALES
DR for ASH
PROD & NPRD
DUB
AMS
DR for DUB
ORE
MTL
PROD & NPRD
NPRD
COL
PROD
Prism
Prism
Prism
Prism
HDFS / S3
Spark
Driver
Data Prep
Interactive
Spark
Driver
Spark
Executor
Spark
Executor
Spark
Driver Lens Build
Phase 1
Lens Build
Phase 2
YARN
ADS
Spark
Executor
Query
Engine
Prism-enabled Tenant - Today
Workday Spark = Apache Spark ++
Apache Spark
Autonomous
Operational
Stability
Core
Stability
Complex Application logic
as Spark Plans
Performance & Scalability for
batch processing
Serviceability
Multi-tenancy
Ingest Latency
Interactive Query
Performance
With this scale, complexity, dependencies…
How can you do Spark version upgrades?
Spark Upgrade challenges:
‒ high number of tenants,
‒ long-running Spark Applications,
‒ progressive roll-out,
‒ rollback case,
‒ maintaining custom Spark fork
Custom Spark Upgrade Model
Custom
Repo
Spark
Version
Custom
Repo
Spark Current
Version
Shim API
Spark Next
Version
Previous Approach
New Approach
Spark single-version support against a single repo
Spark multi-versions support against a single repo
This upgrade model is not specific for Spark upgrade so can be
applied for any internal & external API upgrades when dealing with
these kind of challenges.
This upgrade model is also
used for major and minor
Spark version upgrades.
•Remove PII Data from Logs: Spark query plans and DataFrame schema
obfuscation.
•Catalyst Optimizer: Additional optimization rules on aggregation and large
case statements optimizations.
•Extension for Physical Plan: Enable correlation between Physical Operators
and their runtime metrics.
•Rest APIs: SQL Rest API improvements to query and aggregate physical
operation level metrics.
•Benchmark Module: Additional module to run benchmark tests on introduced
new Spark patches by using standard TPCH and custom queries.
Custom Spark Release Preparation
Shim API
SparkShim
Interface
SparkShimImpl
for Spark v2.3.0
SparkShimImpl
for Spark v2.4.4
Spark API diffs between
both versions may introduce
both compile-time(e.g: Invalid type) and/or
runtime issues (e.g: NoSuchMethodError)
Compile-time & Runtime Version Selections
Classpath Types Description
Compile
-Time
compileClasspath +
testCompileClasspath
Spark compile-time version is
the current version.
Runtime runtimeClasspath +
testRuntimeClasspath
Spark runtime version is
selected by feature toggle as
current or next version.
A sample Gradle build script code snippet on selections of
both Spark and Shim compile-time and runtime classpath versions:Selected Spark versions by classpath types:
Feature Toggle is being used to select Spark version on:
- Build Time (runtime version selection for classpath)
- Test Pipelines (to run UT, IT and Perf Tests by Spark version)
- Environment (to enable Spark version at env level – test, preprod or prod)
Shim API artifacts are shipped in addition to Spark artifacts (by version)
Verification & Progressive Roll-out & Cleanup
Progressive Roll-out Phase
WAVE III
Scope: All Tenants (Internal/Impl/Prod)
Duration: 4 Weeks
WAVE II
Scope: Multiple Tenants (Impl / NonProd)
Duration: 2 Weeks
WAVE I
Scope: Single Tenant (Internal)
Duration: 2 Weeks
Verification Phase
Verify following test pipelines against to both Spark
versions:
• Automated Regression Testing: Running Unit &
Integration Test Pipelines
• Performance Testing:
‒ Spark Benchmark Pipeline: Spark current vs
new version Perf Tests (by executing standard
TPCH and custom queries.) + Hadoop
‒ End2End Perf Pipeline: Custom applications +
Spark + Hadoop
Previous Spark version:
‒ Fork,
‒ Artifacts from artifactory /
mvn repository)
Shim API
Cleanup Phase
Spark SQL Engine - Query Planning & Execution
SQL
Dataset
DataFrame
Unresolved
Logical Plan
Logical
Plan
Optimized
Logical Plan Physical
Plan
CostModel
Selected
Physical
Plan
DAG
Execution
SQL
Metrics
Application Job Stage Task
Spark UI Rest APIs Event Logs
Logical Planning Physical Planning Execution
Analysis Optimizations Physical Plans
Generation
Runtime Metrics Pipeline Architecture
Proton
(Application Server)
Data
Acquisition
Data
Preparation
Query
Engine
HDFS / S3
Spark History
Server Data Warehouse
Stats
App
Hadoop Cluster
Spark
Applications
Spark Hive Tables
• app_metrics
• job_metrics
• stage_metrics
• task_metrics
• executor_metrics
• sql_metrics
Spark Rest APIs
• Application
• Job
• Stage
• Task
• Executors
• SQL (New)
1x1
New Spark SQL Rest API [coming with v3.1.0]
New SQL Rest Endpoints
Comparison of new Spark SQL Rest API Json Outputs
Improved VersionOlder Version (Cherry-picked from OSS)
Improvements
1. Correlation between
physical operators
and their runtime
metrics
2. wholeStageCodege
nId support across
multiple physical
operators
3. Normalization on
metric values to be
able to run
aggregations
Sample Queries on Spark SQL Metrics
What is total loaded
number of input/output
rows by file type, tenant,
application, date?
What are the top 25
tenants running Join, Filter,
Sort (etc..) operations?
What are the mostly used
operations by tenants,
applications, dates?
File Scan Operation
What is number of files
by file type, tenant,
application, date?
What is total scan time
and total metadata time
by min, med, max, file
type, tenant, application,
date?
What is total number of
operations by tenants,
applications, dates?
What are Top 25 Tenants
Having Max Broadcasted
Data Size (GB)?
What is the total number of
joins, BroadcastHashJoin
or SortMergeJoin across all
tenants by day?
Join
What are Top 25 Tenants
Having Max Time to
Collect during Broadcast
(Minute)?
What are Top 25
Tenants Having Max
Time To Broadcast
(Minute) or To Build
during Broadcast
(Minute)?
...
...
...
Correlation between Physical Operators & SQL Metrics
Workday Confidential
•We also integrated our physical plans with runtime SQL metrics
•We can have correlation between Physical Operators and their Runtime Metrics from application logs for troubleshooting and debugging purposes
Developed patches were also backported to OSS repo for community usage:
•[SPARK-31440][SQL] Improve SQL Rest API
https://github.com/apache/spark/pull/28208
•[SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API
https://github.com/apache/spark/pull/29364
•[SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation
https://github.com/apache/spark/pull/28354
Backported Patches to Spark OSS Repo [v3.1.0]
Spark 3.0 introduced following features:
‒ Adaptive Query Execution (SPARK-31412)
‒ Dynamic Partition Pruning (SPARK-11150)
‒ Scala 2.12 Support (SPARK-26132)
‒ JDK 11 Support (SPARK-24417)
‒ Hadoop 3 Support (SPARK-23534)
• Spark 3.x Upgrade (+ Scala, JDK, Hadoop)
• Performance, Troubleshooting and Debugging Improvements
• Multi-Tenancy Support
What is the next?
One more thing...
HDFS / S3
Prism 01
Tenant 01
Prism 02
Tenant 02
Prism 03
Tenant 03
Prism 04
Tenant 04
Spark Cluster Spark Cluster Spark Cluster Spark Cluster
Prism Deployment - Today
Prism Deployment - “Multiverse”
Spark Cluster Spark Cluster
HDFS / S3
Tenant 02 Tenant 04Tenant 03 Tenant 06Tenant 05 Tenant 07Tenant 01 Tenant 08
Prism 01 Prism 02 Prism 03
Spark Cluster
Thank You!
Q & A
TM

Contenu connexe

Tendances

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1AjayRawat971036
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 

Tendances (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1Apache spark-the-definitive-guide-excerpts-r1
Apache spark-the-definitive-guide-excerpts-r1
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 

Similaire à Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Scale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on AzureScale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on AzureAvi Networks
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Internet of Things
Internet of ThingsInternet of Things
Internet of ThingsDeZyre
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Indrajit Poddar
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectDevOps.com
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKSPhil Reay
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKSPhil Reay
 

Similaire à Apache Spark Development Lifecycle @ Workday - ApacheCon 2020 (20)

Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Scale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on AzureScale Your Load Balancer from 0 to 1 million TPS on Azure
Scale Your Load Balancer from 0 to 1 million TPS on Azure
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-Architect
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKS
 
Application Modernisation with PKS
Application Modernisation with PKSApplication Modernisation with PKS
Application Modernisation with PKS
 

Dernier

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Dernier (20)

Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

  • 1. Apache Spark Development Lifecycle @ Workday Pavel Hardak – Eren Avsarogullari
  • 2. •What is Workday? •“Power of One” and Prism Analytics •How Apache Spark fits in? •Custom Spark Upgrade Model •Runtime Metrics Pipeline •What is the next? Agenda
  • 3. • FY20 Revenue $3.6B • ~28% Y/Y Growth • >7,700 customers • >45% of Fortune 500 • >12,300 employees • NASDAQ: WDAY About Workday Enterprise Business Applications for a Changing World • Human Capital, Financials, Planning, Analytics • Cloud native, multi-tenant • 30% revenue re-invested in product each year • >40 Advisory Partners • >200 Software Partners Planning Financial Management Human Capital Management Analytics & Benchmarking
  • 5. Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform
  • 6. Durable Object Data Model MetadataExtensible Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform
  • 7. Security Encryption Privacy and Compliance Trust Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform
  • 8. Reporting and Analytics ExploratoryDescriptive Business Process Framework Object Data Model Reporting and Analytics Security Integration Cloud One Source for Data | One Security Model | One Experience | One Community Machine Learning One Platform Augmented
  • 9. The Leading Enterprise Cloud for Finance and HR 37 Million + workers 100 Billion + transactions per year 96.1% transactions < 1 seconds 99.9% actual availability 200+ companies #1 Future 50, Fortune #2 40 Best Workplaces in Technology, Fortune 10 Thousand + certified resources in the ecosystem
  • 11. Financial Employees GL HR & Payroll Third-Party HR & FIN Industry & Homegrown CRM Marketing Service Subsidiaries Contract Labor Workday Maintains Your Data Gravity
  • 12. Workday Prism Analytics The full spectrum of workforce, financial, and operational insights, all within Workday. Workday Data Non-Workday Data
  • 13. Prism Analytics Momentum - 100% YoY growth Workday Confidential Over Prism Analytics Customers500
  • 14. Table Ingestion Data Prep Examples Engine Lens Build Engine Query Engine and Mercury Workday Spark Runtime Engine Compute (YARN) and Storage (HDFS/S3) Prism UI and APIs Accounting Center People Analytics DBFR Analytics PlatformCosmos DD4A Apache Spark as foundational technology
  • 15. HDFS / S3 Prism 01 Tenant 01 Prism 02 Tenant 02 Prism 03 Tenant 03 Prism 04 Tenant 04 Spark Cluster Spark Cluster Spark Cluster Spark Cluster Prism Tenants - Deployment (simplified)
  • 16. Workday in the Cloud ASH PDX ATL PROD & NPRD ENG PROD & NPRD DR for PDX SALES DR for ASH PROD & NPRD DUB AMS DR for DUB ORE MTL PROD & NPRD NPRD COL PROD
  • 17. Prism Prism Prism Prism HDFS / S3 Spark Driver Data Prep Interactive Spark Driver Spark Executor Spark Executor Spark Driver Lens Build Phase 1 Lens Build Phase 2 YARN ADS Spark Executor Query Engine Prism-enabled Tenant - Today
  • 18. Workday Spark = Apache Spark ++ Apache Spark Autonomous Operational Stability Core Stability Complex Application logic as Spark Plans Performance & Scalability for batch processing Serviceability Multi-tenancy Ingest Latency Interactive Query Performance
  • 19. With this scale, complexity, dependencies… How can you do Spark version upgrades?
  • 20. Spark Upgrade challenges: ‒ high number of tenants, ‒ long-running Spark Applications, ‒ progressive roll-out, ‒ rollback case, ‒ maintaining custom Spark fork Custom Spark Upgrade Model Custom Repo Spark Version Custom Repo Spark Current Version Shim API Spark Next Version Previous Approach New Approach Spark single-version support against a single repo Spark multi-versions support against a single repo This upgrade model is not specific for Spark upgrade so can be applied for any internal & external API upgrades when dealing with these kind of challenges. This upgrade model is also used for major and minor Spark version upgrades.
  • 21. •Remove PII Data from Logs: Spark query plans and DataFrame schema obfuscation. •Catalyst Optimizer: Additional optimization rules on aggregation and large case statements optimizations. •Extension for Physical Plan: Enable correlation between Physical Operators and their runtime metrics. •Rest APIs: SQL Rest API improvements to query and aggregate physical operation level metrics. •Benchmark Module: Additional module to run benchmark tests on introduced new Spark patches by using standard TPCH and custom queries. Custom Spark Release Preparation
  • 22. Shim API SparkShim Interface SparkShimImpl for Spark v2.3.0 SparkShimImpl for Spark v2.4.4 Spark API diffs between both versions may introduce both compile-time(e.g: Invalid type) and/or runtime issues (e.g: NoSuchMethodError)
  • 23. Compile-time & Runtime Version Selections Classpath Types Description Compile -Time compileClasspath + testCompileClasspath Spark compile-time version is the current version. Runtime runtimeClasspath + testRuntimeClasspath Spark runtime version is selected by feature toggle as current or next version. A sample Gradle build script code snippet on selections of both Spark and Shim compile-time and runtime classpath versions:Selected Spark versions by classpath types: Feature Toggle is being used to select Spark version on: - Build Time (runtime version selection for classpath) - Test Pipelines (to run UT, IT and Perf Tests by Spark version) - Environment (to enable Spark version at env level – test, preprod or prod) Shim API artifacts are shipped in addition to Spark artifacts (by version)
  • 24. Verification & Progressive Roll-out & Cleanup Progressive Roll-out Phase WAVE III Scope: All Tenants (Internal/Impl/Prod) Duration: 4 Weeks WAVE II Scope: Multiple Tenants (Impl / NonProd) Duration: 2 Weeks WAVE I Scope: Single Tenant (Internal) Duration: 2 Weeks Verification Phase Verify following test pipelines against to both Spark versions: • Automated Regression Testing: Running Unit & Integration Test Pipelines • Performance Testing: ‒ Spark Benchmark Pipeline: Spark current vs new version Perf Tests (by executing standard TPCH and custom queries.) + Hadoop ‒ End2End Perf Pipeline: Custom applications + Spark + Hadoop Previous Spark version: ‒ Fork, ‒ Artifacts from artifactory / mvn repository) Shim API Cleanup Phase
  • 25. Spark SQL Engine - Query Planning & Execution SQL Dataset DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan CostModel Selected Physical Plan DAG Execution SQL Metrics Application Job Stage Task Spark UI Rest APIs Event Logs Logical Planning Physical Planning Execution Analysis Optimizations Physical Plans Generation
  • 26. Runtime Metrics Pipeline Architecture Proton (Application Server) Data Acquisition Data Preparation Query Engine HDFS / S3 Spark History Server Data Warehouse Stats App Hadoop Cluster Spark Applications Spark Hive Tables • app_metrics • job_metrics • stage_metrics • task_metrics • executor_metrics • sql_metrics Spark Rest APIs • Application • Job • Stage • Task • Executors • SQL (New) 1x1
  • 27. New Spark SQL Rest API [coming with v3.1.0] New SQL Rest Endpoints
  • 28. Comparison of new Spark SQL Rest API Json Outputs Improved VersionOlder Version (Cherry-picked from OSS) Improvements 1. Correlation between physical operators and their runtime metrics 2. wholeStageCodege nId support across multiple physical operators 3. Normalization on metric values to be able to run aggregations
  • 29. Sample Queries on Spark SQL Metrics What is total loaded number of input/output rows by file type, tenant, application, date? What are the top 25 tenants running Join, Filter, Sort (etc..) operations? What are the mostly used operations by tenants, applications, dates? File Scan Operation What is number of files by file type, tenant, application, date? What is total scan time and total metadata time by min, med, max, file type, tenant, application, date? What is total number of operations by tenants, applications, dates? What are Top 25 Tenants Having Max Broadcasted Data Size (GB)? What is the total number of joins, BroadcastHashJoin or SortMergeJoin across all tenants by day? Join What are Top 25 Tenants Having Max Time to Collect during Broadcast (Minute)? What are Top 25 Tenants Having Max Time To Broadcast (Minute) or To Build during Broadcast (Minute)? ... ... ...
  • 30. Correlation between Physical Operators & SQL Metrics Workday Confidential •We also integrated our physical plans with runtime SQL metrics •We can have correlation between Physical Operators and their Runtime Metrics from application logs for troubleshooting and debugging purposes
  • 31. Developed patches were also backported to OSS repo for community usage: •[SPARK-31440][SQL] Improve SQL Rest API https://github.com/apache/spark/pull/28208 •[SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API https://github.com/apache/spark/pull/29364 •[SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation https://github.com/apache/spark/pull/28354 Backported Patches to Spark OSS Repo [v3.1.0]
  • 32. Spark 3.0 introduced following features: ‒ Adaptive Query Execution (SPARK-31412) ‒ Dynamic Partition Pruning (SPARK-11150) ‒ Scala 2.12 Support (SPARK-26132) ‒ JDK 11 Support (SPARK-24417) ‒ Hadoop 3 Support (SPARK-23534) • Spark 3.x Upgrade (+ Scala, JDK, Hadoop) • Performance, Troubleshooting and Debugging Improvements • Multi-Tenancy Support What is the next?
  • 34. HDFS / S3 Prism 01 Tenant 01 Prism 02 Tenant 02 Prism 03 Tenant 03 Prism 04 Tenant 04 Spark Cluster Spark Cluster Spark Cluster Spark Cluster Prism Deployment - Today
  • 35. Prism Deployment - “Multiverse” Spark Cluster Spark Cluster HDFS / S3 Tenant 02 Tenant 04Tenant 03 Tenant 06Tenant 05 Tenant 07Tenant 01 Tenant 08 Prism 01 Prism 02 Prism 03 Spark Cluster
  • 37. TM