SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Anant Nag, LinkedIn
Shankar M, LinkedIn
Beyond unit tests: Testing for
Spark/Hadoop workflows
#EUde12
#EUde12
A day in the life of data engineer
● Produce D.E.D Report by 5 AM.
● At 1 AM, an alert goes off saying the pipeline has
failed
● Dev wakes up, curses his bad luck and starts backfill
job at high priority
● Finds cluster busy, starts killing jobs to make way for
D.E.D job
● Debugs failures, finds today is daylight savings day
and data partitions has gone haywire.
● Most days it works on retry
● Some days, we are not so lucky
NO D.E.D => We are D.E.A.D
#EUde12
Scale @ LinkedIn
• 10s of clusters running different versions of software
• 1000s of machines in each cluster
• 1000s of users
• 100s of 1000s of Azkaban workflows running per month
• Powers key business impacting features
– People you may know
– Who viewed my profile
#EUde12
● Cluster gets upgraded
● Data partition changes
● Code needs to be rewritten in a
new technology
● Different version of a dependent
jar is available
● ...
Nightmares at data street
#EUde12
Do you know your dependencies?
● Direct dependencies
● Indirect dependencies
● Hidden dependencies
● Semantic dependencies
“Hey, I am changing column X in data P to format B. Do you foresee any issues?”
Do you know your dependencies?
#EUde12
Paranoia is justified
No confidence to make changes
Lack of agility
Loss of innovation
Paranoia is justified
#EUde12
Architecture
● Workflow definition
● Test definitions
● Test execution environment:
○ Local
○ Production
● Test data
Architecture
#EUde12
hadoop {
workflow('workflow1') {
sparkJob('job1') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
jars ‘jar1.jar,jar2.jar’
executorMemory ‘2G’
numExecutors 400
}
sparkJob('job2') {
uses 'com.linkedin.example.SparkJob2’
executes ‘exampleSpark.jar’
depends 'job1'
}
targets 'job2'
}
}
Workflow definition
#EUde12
Test definition
hadoop {
workflow('countByCountryFlow') {
sparkJob('countByCountry') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
}
}
workflowTestSuite("test2") {
...
}
}
Test definition
#EUde12
hadoop {
workflow('countByCountryFlow') {
sparkJob('countByCountry') {
uses 'com.linkedin.example.SparkJob’
executes ‘exampleSpark.jar’
reads files: [
'input_data': "/data/input"
]
writes files: [
'output_path': "/jobs/output"
]
}
targets 'countByCountry'
}
}
hadoop {
workflowTestSuite("test1") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
reads files: [
'input_data': '/path/to/test/data'
]
writes files: [
'output_path': '/path/to/test/output'
]
}
}
}
}
Overriding parameters
#EUde12
● 10s of clusters
● Multiple versions of Spark
● Some clusters update now, some later
● Code should run on all the versions
Configuration override
#EUde12
Configuration override
● Write multiple tests
● One test for each version of Spark
● Override Spark version in the tests
workflowTestSuite("testWithSpark1_6_3") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘spark.version’: ‘1.6.3’
]
}
}
}
workflowTestSuite("testWithSpark2_1_0") {
addWorkflow('countByCountryFlow') {
lookup('countByCountry') {
set properties: [
‘spark.version’: ‘2.1.0’
]
}
}
}
Configuration override
#EUde12
assert(x == y)
#EUde12
● Complex pipeline with 10s of Jobs
● Most of them are Spark Jobs
True story
#EUde12
● DataFrames API
● Rewrite all spark jobs to use DataFrames
● Is my new code ready for production??
● Write tests
○ Assertions on output
● All tests succeed after changes (^_^)
#EUde12
Data Validation and Assertion
● Types of checks
○ Record level checks
○ Aggregate level
■ Data transformation
■ Data aggregation
■ Data distribution
● Assert against Expectation
Data validation and assertion
#EUde12
Record level Validation
us 148083
in 46074
cn 34332
br 30836
gb 24387
fr 14983
...
hadoop {
workflowTestSuite(‘test1’) {
addWorkflow(‘countFlow’,’testCountFlow’){}
assertionWorkflow('assertNonNegativeCount')
{
sparkJob("assertNonNegativeCount") {
}
targets 'assertNonNegativeCount'
}
}
val count = spark.read.avro(input)
require(count.
map(r => r.getAs[Long](“count”)).
where(_ < 0 )).count() == 0)
Record level validation
#EUde12
Aggregated validation
● Binary spam classifier C1
● C1 classifies 30% of test input as spam
● Wrote a new classifier C2
● C2 can deviate at most 10%
● Write aggregated validations
Aggregated validation
#EUde12
Execution
#EUde12
Test execution
$> gradle azkabanTest -Ptestname=test1
● Test results on the terminal
● Reports for the passed and failed tests
Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED
Summary of the completed tests
Job Statistics for individual test
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2
Test execution
#EUde12
Test automation
● Auto deployment process for Azkaban artifacts
● Tests run as part of the deployment
● Tests fail => Artifact deployment fails
● No un-tested code can go to production
Test automation
#EUde12
Test Data
#EUde12
Test data
● Real data
○ Very large data
○ Tests run very slow
● Randomly generated data
○ Not equivalent to real data
○ Real issues can never be catched
● Manually created data
○ Covers all the cases
○ Too much effort
● Samples
Test data
#EUde12
Requirements of test data
● Representative of real data
● Smaller in size
● Automated generation
● Sharable
● Discoverable
Requirements of sample data
The road ahead...
Road ahead
#EUde12
Flexible execution environment
● Sandboxed
● Replicate production settings
● Save and Restore
● Ability to run in a single box
Flexible execution environment
#EUde12
Data validation framework
● Validation logic in schema
● Automated validation on read and write
● Serves as contract between producers and consumers
Data validation Framework
#EUde12
Azkaban DSL: https://github.com/linkedin/linkedin-gradle-plugin-for-apache-hadoop
Azkaban: https://github.com/azkaban/azkaban

Contenu connexe

Tendances

Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
PGConf.ASIA 2017 Logical Replication Internals (English)
PGConf.ASIA 2017 Logical Replication Internals (English)PGConf.ASIA 2017 Logical Replication Internals (English)
PGConf.ASIA 2017 Logical Replication Internals (English)Noriyoshi Shinoda
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSASrikrupa Srivatsan
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Bernardo Damele A. G.
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Daniel Toomey
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Enterprise and multi-tier Power BI deployments with Azure DevOps.
Enterprise and multi-tier Power BI deployments with Azure DevOps.Enterprise and multi-tier Power BI deployments with Azure DevOps.
Enterprise and multi-tier Power BI deployments with Azure DevOps.Marc Lelijveld
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
MarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesMarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesIoan Toma
 

Tendances (20)

Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
PGConf.ASIA 2017 Logical Replication Internals (English)
PGConf.ASIA 2017 Logical Replication Internals (English)PGConf.ASIA 2017 Logical Replication Internals (English)
PGConf.ASIA 2017 Logical Replication Internals (English)
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSA
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)Advanced SQL injection to operating system full control (slides)
Advanced SQL injection to operating system full control (slides)
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Enterprise and multi-tier Power BI deployments with Azure DevOps.
Enterprise and multi-tier Power BI deployments with Azure DevOps.Enterprise and multi-tier Power BI deployments with Azure DevOps.
Enterprise and multi-tier Power BI deployments with Azure DevOps.
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
MarkLogic Overview and Use Cases
MarkLogic Overview and Use CasesMarkLogic Overview and Use Cases
MarkLogic Overview and Use Cases
 

Similaire à Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Gatling Performance Workshop
Gatling Performance WorkshopGatling Performance Workshop
Gatling Performance WorkshopSai Krishna
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Android training day 4
Android training day 4Android training day 4
Android training day 4Vivek Bhusal
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
 
Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001jucaab
 
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzioneJava 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzioneThinkOpen
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Pandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL PluginPandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL PluginPandora FMS
 
Using Grails to Power your Electric Car
Using Grails to Power your Electric CarUsing Grails to Power your Electric Car
Using Grails to Power your Electric CarGR8Conf
 
How to test code with mruby
How to test code with mrubyHow to test code with mruby
How to test code with mrubyHiroshi SHIBATA
 
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessGive Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessFormant
 
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...Nagios
 

Similaire à Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag (20)

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Gatling Performance Workshop
Gatling Performance WorkshopGatling Performance Workshop
Gatling Performance Workshop
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Sge
SgeSge
Sge
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Android training day 4
Android training day 4Android training day 4
Android training day 4
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001Ebs dba con4696_pdf_4696_0001
Ebs dba con4696_pdf_4696_0001
 
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzioneJava 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
Java 8 -12: da Oracle a Eclipse. Due anni e una rivoluzione
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Gradle como alternativa a maven
Gradle como alternativa a mavenGradle como alternativa a maven
Gradle como alternativa a maven
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
Pandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL PluginPandora FMS: PostgreSQL Plugin
Pandora FMS: PostgreSQL Plugin
 
Using Grails to Power your Electric Car
Using Grails to Power your Electric CarUsing Grails to Power your Electric Car
Using Grails to Power your Electric Car
 
How to test code with mruby
How to test code with mrubyHow to test code with mruby
How to test code with mruby
 
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessGive Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
 
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...
Nagios Conference 2013 - Troy Lea - Leveraging and Understanding Performance ...
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Dernier (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Anant Nag

  • 1. Anant Nag, LinkedIn Shankar M, LinkedIn Beyond unit tests: Testing for Spark/Hadoop workflows #EUde12
  • 2. #EUde12 A day in the life of data engineer ● Produce D.E.D Report by 5 AM. ● At 1 AM, an alert goes off saying the pipeline has failed ● Dev wakes up, curses his bad luck and starts backfill job at high priority ● Finds cluster busy, starts killing jobs to make way for D.E.D job ● Debugs failures, finds today is daylight savings day and data partitions has gone haywire. ● Most days it works on retry ● Some days, we are not so lucky NO D.E.D => We are D.E.A.D
  • 3. #EUde12 Scale @ LinkedIn • 10s of clusters running different versions of software • 1000s of machines in each cluster • 1000s of users • 100s of 1000s of Azkaban workflows running per month • Powers key business impacting features – People you may know – Who viewed my profile
  • 4. #EUde12 ● Cluster gets upgraded ● Data partition changes ● Code needs to be rewritten in a new technology ● Different version of a dependent jar is available ● ... Nightmares at data street
  • 5. #EUde12 Do you know your dependencies? ● Direct dependencies ● Indirect dependencies ● Hidden dependencies ● Semantic dependencies “Hey, I am changing column X in data P to format B. Do you foresee any issues?” Do you know your dependencies?
  • 6. #EUde12 Paranoia is justified No confidence to make changes Lack of agility Loss of innovation Paranoia is justified
  • 7. #EUde12 Architecture ● Workflow definition ● Test definitions ● Test execution environment: ○ Local ○ Production ● Test data Architecture
  • 8. #EUde12 hadoop { workflow('workflow1') { sparkJob('job1') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ jars ‘jar1.jar,jar2.jar’ executorMemory ‘2G’ numExecutors 400 } sparkJob('job2') { uses 'com.linkedin.example.SparkJob2’ executes ‘exampleSpark.jar’ depends 'job1' } targets 'job2' } } Workflow definition
  • 9. #EUde12 Test definition hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { } } workflowTestSuite("test2") { ... } } Test definition
  • 10. #EUde12 hadoop { workflow('countByCountryFlow') { sparkJob('countByCountry') { uses 'com.linkedin.example.SparkJob’ executes ‘exampleSpark.jar’ reads files: [ 'input_data': "/data/input" ] writes files: [ 'output_path': "/jobs/output" ] } targets 'countByCountry' } } hadoop { workflowTestSuite("test1") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { reads files: [ 'input_data': '/path/to/test/data' ] writes files: [ 'output_path': '/path/to/test/output' ] } } } } Overriding parameters
  • 11. #EUde12 ● 10s of clusters ● Multiple versions of Spark ● Some clusters update now, some later ● Code should run on all the versions Configuration override
  • 12. #EUde12 Configuration override ● Write multiple tests ● One test for each version of Spark ● Override Spark version in the tests workflowTestSuite("testWithSpark1_6_3") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘1.6.3’ ] } } } workflowTestSuite("testWithSpark2_1_0") { addWorkflow('countByCountryFlow') { lookup('countByCountry') { set properties: [ ‘spark.version’: ‘2.1.0’ ] } } } Configuration override
  • 14. #EUde12 ● Complex pipeline with 10s of Jobs ● Most of them are Spark Jobs True story
  • 15. #EUde12 ● DataFrames API ● Rewrite all spark jobs to use DataFrames ● Is my new code ready for production?? ● Write tests ○ Assertions on output ● All tests succeed after changes (^_^)
  • 16. #EUde12 Data Validation and Assertion ● Types of checks ○ Record level checks ○ Aggregate level ■ Data transformation ■ Data aggregation ■ Data distribution ● Assert against Expectation Data validation and assertion
  • 17. #EUde12 Record level Validation us 148083 in 46074 cn 34332 br 30836 gb 24387 fr 14983 ... hadoop { workflowTestSuite(‘test1’) { addWorkflow(‘countFlow’,’testCountFlow’){} assertionWorkflow('assertNonNegativeCount') { sparkJob("assertNonNegativeCount") { } targets 'assertNonNegativeCount' } } val count = spark.read.avro(input) require(count. map(r => r.getAs[Long](“count”)). where(_ < 0 )).count() == 0) Record level validation
  • 18. #EUde12 Aggregated validation ● Binary spam classifier C1 ● C1 classifies 30% of test input as spam ● Wrote a new classifier C2 ● C2 can deviate at most 10% ● Write aggregated validations Aggregated validation
  • 20. #EUde12 Test execution $> gradle azkabanTest -Ptestname=test1 ● Test results on the terminal ● Reports for the passed and failed tests Tests [1/1]:=> Flow TEST-countByCountry completed with status SUCCEEDED Summary of the completed tests Job Statistics for individual test ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Flow Name | Latest Exec ID | Status | Running | Succeeded | Failed | Ready | Cancelled | Disabled | Total ------------------------------------------------------------------------------------------------------------------------------------------------------------------- TEST-countByCountry | 2997645 | SUCCEEDED | 0 | 2 | 0 | 0 | 0 | 0 | 2 Test execution
  • 21. #EUde12 Test automation ● Auto deployment process for Azkaban artifacts ● Tests run as part of the deployment ● Tests fail => Artifact deployment fails ● No un-tested code can go to production Test automation
  • 23. #EUde12 Test data ● Real data ○ Very large data ○ Tests run very slow ● Randomly generated data ○ Not equivalent to real data ○ Real issues can never be catched ● Manually created data ○ Covers all the cases ○ Too much effort ● Samples Test data
  • 24. #EUde12 Requirements of test data ● Representative of real data ● Smaller in size ● Automated generation ● Sharable ● Discoverable Requirements of sample data
  • 26. #EUde12 Flexible execution environment ● Sandboxed ● Replicate production settings ● Save and Restore ● Ability to run in a single box Flexible execution environment
  • 27. #EUde12 Data validation framework ● Validation logic in schema ● Automated validation on read and write ● Serves as contract between producers and consumers Data validation Framework