SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Faster Data Integration Pipeline
Execution using Spark Job Server
Presenter:
Sailee Jain and Prabhakar Gouda
Who are we?
▪ Sailee Jain
▪ Senior Software Engineer at Informatica
▪ ~6 years of experience working on various
flavors of Data Engineering products
▪ LinkedIn -
https://www.linkedin.com/in/saileejain/
▪ Prabhakar Gouda
▪ Senior Software Engineer at Informatica
▪ ~8 years of experience in Software industry
▪ LinkedIn -
https://www.linkedin.com/in/prabhakarGouda/
Informatica
• Leading provider of Data Engineering solutions
• Informatica offerings -
Agenda
Informatica Big Data ETL
Complex Data Types and
Associated Challenges
Data Preview Use-case
Informatica Product
Architecture
Integrating Spark Job
server with Informatica
Demo Q&A
Configure and Tune
Spark Job server
Informatica ETL Pipeline
ETL pipelines
Informatica ETL Pipeline
Dealing with buggy pipelines
▪ Where is the error?
▪ Is it due to wrong choice of data
types?
▪ Is it due to incorrect usage of
transformation?
▪ Which transformation?
▪ Check the data after each midstream
transformation
Solution - Data Preview
Data Preview – Feature Requirements
▪ Ability to comprehend Complex Data types (e.g. – map, struct, array)
▪ Support variety of Data Sources (Cassandra, HDFS, S3, Azure etc.)
▪ Faster execution (Tradeoff execution time with data size)
▪ Work with minimal changes to existing codebase
▪ Support all existing spark features and Informatica transformations
Spark-submit based Approach
What spark-submit based data preview achieved?
Feature Supportability
Complex Data types
Support variety of Data Sources
Faster execution
minimal changes to existing codebase
existing spark features/ transformations Support
Execution Profiling Results - Spark-submit
▪ Validation and
Optimization
▪ Translation
▪ Compilation
▪ Spark Execution
Alternatives for Faster Execution
Spark Job Server
▪ Provides a RESTful interface for submitting and managing Apache
Spark jobs, jars, and job contexts.
▪ Well documented
▪ Active community
▪ Easy integration
▪ Suitable for low latency queries
Interchangeably referred as SJS or Job Server
Compare Spark-submit with Spark Job Server
Metrics Spark-submit Spark Job Server
Spark context sharing
across the jobs
Not Supported, every job runs as
a new Yarn application
Allows context sharing
across jobs
Named object sharing
across the jobs
Not Supported Allows RDD sharing,
allows data-frame
sharing
Spark-submit based Architecture
SJS based Architecture
New Component:
Spark Job Server
Execution Flow
Informatica Client
Informatica
Server
Spark Job Server Hadoop Cluster
Start Spark Job Server
Create Spark context
Create Spark context
Submit spark task for execution
Return result
If SJS is idle for
long time Delete Spark context
Stop Spark Job Server
Execute using Spark context
Create Data
Pipeline Execute preview
Execute using Spark context
Second data preview request
Submit spark task for execution
Return result
Spark Job Server vs Spark-submit
Ø Spark Job Server is on par with Spark-submit
for the first run
Ø Subsequent runs are faster because of the
shared Spark context
Ø Along with helping our customers, it helps
developers (like us) to get visual feedback of
our data while handling production pipeline
bugs, ensuring quicker RCA
Informatica Server Configuration
Cores: 2 x 12-core, Memory: 128 GB
Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
Operating Sys: RedHat Ent. Linux 7.0
Cluster Node Configuration
12 nodes Cloudera 6.1
Cores: 24-core, Memory: 256 GB
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Operating Sys: RedHat Ent. Linux 7.0
Our Journey with Spark Job Server
Development Process
Setup Details
Ø Spark Job Server is configured on the same
host as Informatica Server
Ø Spark Job Server version - 0.9.0
Ø Spark deploy mode - YARN cluster
Ø Supported Hadoop distribution vendors
Ø EMR
Ø Cloudera
Ø Hortonworks
Ø Cloudera Data Platform
Getting started
▪ Get SJS source code from git (https://github.com/spark-jobserver/spark-
jobserver)
▪ Install compatible version of sbt(defined in build.properties)
▪ Create following file copies and edit as appropriate -
▪ local.sh.template: Script template for setting the environment variables required to start
SJS
▪ local.conf.template: TypeSafeConfig template file for defining the Job Server
configuration
▪ Execute server_package.sh <env> to generate spark-job-
server.jar
Environment Variables (local.sh.template)
Environment variable Purpose
PIDFILE Job Server process id file
JOBSERVER_MEMORY defaults to 1G, the amount of memory (eg 512m, 2G) to give to job
server
MAX_DIRECT_MEMORY Job Server's value for -XX:MaxDirectMemorySize option
LOG_DIR Job Server log directory
MANAGER_EXTRA_SPARK_CONFS Spark extra configurations
SPARK_VERSION Spark version
SCALA_VERSION Scala version, Example: 2.11.8
SPARK_HOME SPARK_HOME directory on the Job Server machine
YARN_CONF_DIR and HADOOP_CONF_DIR Directory containing all site-xmls (core, yarn, hbase, hive etc)
SPARK_CONF_DIR Directory containing Spark configuration file
Application Code Migration
Ø Create a Jar containing application logic
Ø Launch a spark-submit with the application jar
and entry-point class
Ø Modify the entry point class to extend from
SparkSessionJob
Ø Create a Jar containing application logic.
Ø REST request to upload the application jar
Ø REST request to submit the application jar for
execution
Spark Job ServerSpark-submit
WordCount Example
Spark Standalone Spark Job Server
Running Jobs
1. Create spark-context (shared/per job)
curl -d "" "localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m"
OK⏎
2. Upload job binary
curl -X POST localhost:8090/binaries/test -H "Content-Type: application/java-archive" --data-binary
@/<path.to.your.jar>
OK⏎
3. Submit job for execution
curl -d "input.string = a b c a b see"
"localhost:8090/jobs?appName=test&classPath=<entryClass>&context=test-context&sync=true
Useful REST APIs
▪ /data
▪ POST /data/<prefix> - Uploads a new file
▪ /binaries
▪ POST /binaries/<appName> - upload a new binary file
▪ DELETE /binaries/<appName> - delete defined binary
▪ /jobs
▪ POST /jobs - Starts a new job, use ?sync=true to wait for results
▪ GET /jobs/<jobId> - Gets the result or status of a specific job
▪ DELETE /jobs/<jobId> - Kills the specified job
▪ /contexts
▪ POST /contexts/<name> - creates a new context
▪ DELETE /contexts/<name> - stops a context and all jobs running in it.
Challenges
Handling Job Dependencies
Ø Traditionally spark-submit provides --files, --archives and --jars option
to localize the resources on the cluster nodes
Ø Equivalent properties in spark-conf are -
--file = spark.yarn.dist.files
--jars = spark.yarn.dist.jars
--archives = spark.yarn.dist.archives
Ø Honored only once at the time of spark-context creation
Ø Job specific jars can be provided using dependent-jar-uris or cp context
configuration param
Multiple Spark Job Servers
▪ One Job Server instance can execute job on only one Hadoop cluster.
▪ For running multiple Job Server instances on the same host configure
the following ports:-
▪ JMX port - monitoring
▪ HTTP port - Job Server HTTP port
▪ H2DB port - Required only if you are using H2DB for metadata management
Concurrency
▪ Maximum jobs per context
▪ spark.jobserver.max-jobs-per-context = <concurrenyCount>
▪ If not set, defaults to number of cores on machine, where Job Server is running
▪ Spark task level concurrency
▪ Too few partitions – Cannot utilize all cores available in the cluster
▪ Too many partitions – Excessive overhead in managing many small tasks
▪ Rule: partition count = (Input Data size)/(size per partition)
For example: If you are working with 5120 MB of source data and 128 MB HDFS partition then
set, 5120 / 128 = 40
--conf spark.sql.shuffle.partitions = 40
--conf spark.default.parallelism = 40
Dependency conflicts
Ø Spark job server is an uber jar
Ø Adding uber jars to your classpath can result in version conflicts
Ø Solutions
1. spark.driver.userClassPathFirst and spark.executor.userClassPathFirst
2. Sync dependency versions in
▪ spark-jobserver/project/Dependencies.scala
3. Jar shading
Modify assembly here - spark-jobserver/project/Assembly.scala
Support for Kerberos
▪ Using Kerberos Principal and Keytab
▪ Add following properties to Spark configuration file
▪ spark.yarn.principal: User Kerberos principal
▪ spark.yarn.keytab: keytab file location on the Job Server host
▪ Spark context is started using Job Server user.
▪ Using Impersonation User
▪ Generate the Kerberos token
▪ Add spark.yarn.proxy-user=<ImpersonationUser> in Spark configuration file
▪ Spark context will be started using Impersonation user
HTTPS/SSL Enabled Server
▪ Add following properties in Job Server configuration file
▪ spray.can.server.keystore
▪ spray.can.server.keystorePW
▪ spray.can.server.ssl-encryption
Logging
▪ Adding Job-Id makes debugging easier
▪ Log format and logger level can be controlled from
▪ Log4j.properties
log4j.appender.console.layout.ConversionPattern=[%d] %-5p %.26c [%X{jobId}] - %m%n
Purpose Default log file name
Logs from
server_start.sh
$LOG_DIR/server_start.log
Spark job server logs $LOG_DIR/log/spark-job-server.log
Spark context logs $LOG_DIR/log/<uniqueId>/spark-job-server.out
Key Takeaways & Recommendations
Key Takeaways
▪ Increase timeouts
▪ Important consideration for Yarn cluster mode
▪ Remote clusters result in network delays which can cause failures due to timeouts
▪ Class-path issues
▪ Long running application – long live classpaths – resources once added are present for the entire life of the context
▪ Use unique package name to distinguish between applications
▪ Resources/memory configs become static per job
▪ Long running Spark context – Resource configurations can only be done at the time of Spark context creation
▪ Anticipate for load when creating the Spark context
▪ Executors keep-alive can enhance the performance
▪ Depending upon the usage-pattern - if you have steady load, then keeping the executors alive can enhance performance.
▪ Consider removing uploaded binaries and data at regular intervals
Timeouts (in local.conf.template)
Property Default
Value
Description
spark.context-settings.context-init-
timeout
60s Timeout for SupervisorActor to wait for forked (separate
JVM) contexts to initialize
spark.context-settings.forked-jvm-init-
timeout
30s Timeout for forked JVM to spin up and acquire resources
spark.jobserver.short-timeout 3s The ask pattern timeout for API
spark.jobserver.yarn-context-creation-
timeout
40s in yarn deployment, time out for job server to wait while
creating contexts
spark.jobserver.yarn-context-deletion-
timeout
40s in yarn deployment, time out for job server to wait while
deleting contexts
spray.can.server.idle-timeout 60s Spray can HTTP server idle timeout
spray.can.server.request-timeout 40s Spray can HTTP server request timeout, idle-timeout
should always greater than request-timeout
Complex Data Representation in Informatica
Developer Tool
Struct
Array
Map
Primitives
Monitoring: Binaries
▪ Possible to add/remove binaries
▪ Upload job binary and execute the job at any time
Uploaded Jar
Monitoring: Spark Context
▪ Lists the running Spark contexts
▪ Can stop Spark context from UI
Spark Context name
Spark history server URL
Kill job
Monitoring: Jobs
http://<Job Server host>:<port>/
Monitoring: Yarn Job
▪ Long running Spark context
▪ Impersonation username
▪ Execution status
Yarn application Name Impersonation user
Demo
Q&A
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Faster Data Integration Pipeline Execution using Spark-Jobserver

Contenu connexe

Tendances

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 

Tendances (20)

Managing ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache SparkManaging ADLS gen2 using Apache Spark
Managing ADLS gen2 using Apache Spark
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L...
 Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L... Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L...
Dynamic Class-Based Spark Workload Scheduling and Resource Using YARN with L...
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Composable Data Processing with Apache Spark
Composable Data Processing with Apache SparkComposable Data Processing with Apache Spark
Composable Data Processing with Apache Spark
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Solving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQLSolving low latency query over big data with Spark SQL
Solving low latency query over big data with Spark SQL
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 

Similaire à Faster Data Integration Pipeline Execution using Spark-Jobserver

Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

Similaire à Faster Data Integration Pipeline Execution using Spark-Jobserver (20)

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architecture
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Faster Data Integration Pipeline Execution using Spark-Jobserver

  • 1.
  • 2. Faster Data Integration Pipeline Execution using Spark Job Server Presenter: Sailee Jain and Prabhakar Gouda
  • 3. Who are we? ▪ Sailee Jain ▪ Senior Software Engineer at Informatica ▪ ~6 years of experience working on various flavors of Data Engineering products ▪ LinkedIn - https://www.linkedin.com/in/saileejain/ ▪ Prabhakar Gouda ▪ Senior Software Engineer at Informatica ▪ ~8 years of experience in Software industry ▪ LinkedIn - https://www.linkedin.com/in/prabhakarGouda/
  • 4. Informatica • Leading provider of Data Engineering solutions • Informatica offerings -
  • 5. Agenda Informatica Big Data ETL Complex Data Types and Associated Challenges Data Preview Use-case Informatica Product Architecture Integrating Spark Job server with Informatica Demo Q&A Configure and Tune Spark Job server
  • 8. Dealing with buggy pipelines ▪ Where is the error? ▪ Is it due to wrong choice of data types? ▪ Is it due to incorrect usage of transformation? ▪ Which transformation? ▪ Check the data after each midstream transformation
  • 9. Solution - Data Preview
  • 10. Data Preview – Feature Requirements ▪ Ability to comprehend Complex Data types (e.g. – map, struct, array) ▪ Support variety of Data Sources (Cassandra, HDFS, S3, Azure etc.) ▪ Faster execution (Tradeoff execution time with data size) ▪ Work with minimal changes to existing codebase ▪ Support all existing spark features and Informatica transformations
  • 12. What spark-submit based data preview achieved? Feature Supportability Complex Data types Support variety of Data Sources Faster execution minimal changes to existing codebase existing spark features/ transformations Support
  • 13. Execution Profiling Results - Spark-submit ▪ Validation and Optimization ▪ Translation ▪ Compilation ▪ Spark Execution
  • 15. Spark Job Server ▪ Provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. ▪ Well documented ▪ Active community ▪ Easy integration ▪ Suitable for low latency queries Interchangeably referred as SJS or Job Server
  • 16. Compare Spark-submit with Spark Job Server Metrics Spark-submit Spark Job Server Spark context sharing across the jobs Not Supported, every job runs as a new Yarn application Allows context sharing across jobs Named object sharing across the jobs Not Supported Allows RDD sharing, allows data-frame sharing
  • 18. SJS based Architecture New Component: Spark Job Server
  • 19. Execution Flow Informatica Client Informatica Server Spark Job Server Hadoop Cluster Start Spark Job Server Create Spark context Create Spark context Submit spark task for execution Return result If SJS is idle for long time Delete Spark context Stop Spark Job Server Execute using Spark context Create Data Pipeline Execute preview Execute using Spark context Second data preview request Submit spark task for execution Return result
  • 20. Spark Job Server vs Spark-submit Ø Spark Job Server is on par with Spark-submit for the first run Ø Subsequent runs are faster because of the shared Spark context Ø Along with helping our customers, it helps developers (like us) to get visual feedback of our data while handling production pipeline bugs, ensuring quicker RCA Informatica Server Configuration Cores: 2 x 12-core, Memory: 128 GB Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Operating Sys: RedHat Ent. Linux 7.0 Cluster Node Configuration 12 nodes Cloudera 6.1 Cores: 24-core, Memory: 256 GB Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Operating Sys: RedHat Ent. Linux 7.0
  • 21. Our Journey with Spark Job Server Development Process
  • 22. Setup Details Ø Spark Job Server is configured on the same host as Informatica Server Ø Spark Job Server version - 0.9.0 Ø Spark deploy mode - YARN cluster Ø Supported Hadoop distribution vendors Ø EMR Ø Cloudera Ø Hortonworks Ø Cloudera Data Platform
  • 23. Getting started ▪ Get SJS source code from git (https://github.com/spark-jobserver/spark- jobserver) ▪ Install compatible version of sbt(defined in build.properties) ▪ Create following file copies and edit as appropriate - ▪ local.sh.template: Script template for setting the environment variables required to start SJS ▪ local.conf.template: TypeSafeConfig template file for defining the Job Server configuration ▪ Execute server_package.sh <env> to generate spark-job- server.jar
  • 24. Environment Variables (local.sh.template) Environment variable Purpose PIDFILE Job Server process id file JOBSERVER_MEMORY defaults to 1G, the amount of memory (eg 512m, 2G) to give to job server MAX_DIRECT_MEMORY Job Server's value for -XX:MaxDirectMemorySize option LOG_DIR Job Server log directory MANAGER_EXTRA_SPARK_CONFS Spark extra configurations SPARK_VERSION Spark version SCALA_VERSION Scala version, Example: 2.11.8 SPARK_HOME SPARK_HOME directory on the Job Server machine YARN_CONF_DIR and HADOOP_CONF_DIR Directory containing all site-xmls (core, yarn, hbase, hive etc) SPARK_CONF_DIR Directory containing Spark configuration file
  • 25. Application Code Migration Ø Create a Jar containing application logic Ø Launch a spark-submit with the application jar and entry-point class Ø Modify the entry point class to extend from SparkSessionJob Ø Create a Jar containing application logic. Ø REST request to upload the application jar Ø REST request to submit the application jar for execution Spark Job ServerSpark-submit
  • 27. Running Jobs 1. Create spark-context (shared/per job) curl -d "" "localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m" OK⏎ 2. Upload job binary curl -X POST localhost:8090/binaries/test -H "Content-Type: application/java-archive" --data-binary @/<path.to.your.jar> OK⏎ 3. Submit job for execution curl -d "input.string = a b c a b see" "localhost:8090/jobs?appName=test&classPath=<entryClass>&context=test-context&sync=true
  • 28. Useful REST APIs ▪ /data ▪ POST /data/<prefix> - Uploads a new file ▪ /binaries ▪ POST /binaries/<appName> - upload a new binary file ▪ DELETE /binaries/<appName> - delete defined binary ▪ /jobs ▪ POST /jobs - Starts a new job, use ?sync=true to wait for results ▪ GET /jobs/<jobId> - Gets the result or status of a specific job ▪ DELETE /jobs/<jobId> - Kills the specified job ▪ /contexts ▪ POST /contexts/<name> - creates a new context ▪ DELETE /contexts/<name> - stops a context and all jobs running in it.
  • 30. Handling Job Dependencies Ø Traditionally spark-submit provides --files, --archives and --jars option to localize the resources on the cluster nodes Ø Equivalent properties in spark-conf are - --file = spark.yarn.dist.files --jars = spark.yarn.dist.jars --archives = spark.yarn.dist.archives Ø Honored only once at the time of spark-context creation Ø Job specific jars can be provided using dependent-jar-uris or cp context configuration param
  • 31. Multiple Spark Job Servers ▪ One Job Server instance can execute job on only one Hadoop cluster. ▪ For running multiple Job Server instances on the same host configure the following ports:- ▪ JMX port - monitoring ▪ HTTP port - Job Server HTTP port ▪ H2DB port - Required only if you are using H2DB for metadata management
  • 32. Concurrency ▪ Maximum jobs per context ▪ spark.jobserver.max-jobs-per-context = <concurrenyCount> ▪ If not set, defaults to number of cores on machine, where Job Server is running ▪ Spark task level concurrency ▪ Too few partitions – Cannot utilize all cores available in the cluster ▪ Too many partitions – Excessive overhead in managing many small tasks ▪ Rule: partition count = (Input Data size)/(size per partition) For example: If you are working with 5120 MB of source data and 128 MB HDFS partition then set, 5120 / 128 = 40 --conf spark.sql.shuffle.partitions = 40 --conf spark.default.parallelism = 40
  • 33. Dependency conflicts Ø Spark job server is an uber jar Ø Adding uber jars to your classpath can result in version conflicts Ø Solutions 1. spark.driver.userClassPathFirst and spark.executor.userClassPathFirst 2. Sync dependency versions in ▪ spark-jobserver/project/Dependencies.scala 3. Jar shading Modify assembly here - spark-jobserver/project/Assembly.scala
  • 34. Support for Kerberos ▪ Using Kerberos Principal and Keytab ▪ Add following properties to Spark configuration file ▪ spark.yarn.principal: User Kerberos principal ▪ spark.yarn.keytab: keytab file location on the Job Server host ▪ Spark context is started using Job Server user. ▪ Using Impersonation User ▪ Generate the Kerberos token ▪ Add spark.yarn.proxy-user=<ImpersonationUser> in Spark configuration file ▪ Spark context will be started using Impersonation user
  • 35. HTTPS/SSL Enabled Server ▪ Add following properties in Job Server configuration file ▪ spray.can.server.keystore ▪ spray.can.server.keystorePW ▪ spray.can.server.ssl-encryption
  • 36. Logging ▪ Adding Job-Id makes debugging easier ▪ Log format and logger level can be controlled from ▪ Log4j.properties log4j.appender.console.layout.ConversionPattern=[%d] %-5p %.26c [%X{jobId}] - %m%n Purpose Default log file name Logs from server_start.sh $LOG_DIR/server_start.log Spark job server logs $LOG_DIR/log/spark-job-server.log Spark context logs $LOG_DIR/log/<uniqueId>/spark-job-server.out
  • 37. Key Takeaways & Recommendations
  • 38. Key Takeaways ▪ Increase timeouts ▪ Important consideration for Yarn cluster mode ▪ Remote clusters result in network delays which can cause failures due to timeouts ▪ Class-path issues ▪ Long running application – long live classpaths – resources once added are present for the entire life of the context ▪ Use unique package name to distinguish between applications ▪ Resources/memory configs become static per job ▪ Long running Spark context – Resource configurations can only be done at the time of Spark context creation ▪ Anticipate for load when creating the Spark context ▪ Executors keep-alive can enhance the performance ▪ Depending upon the usage-pattern - if you have steady load, then keeping the executors alive can enhance performance. ▪ Consider removing uploaded binaries and data at regular intervals
  • 39. Timeouts (in local.conf.template) Property Default Value Description spark.context-settings.context-init- timeout 60s Timeout for SupervisorActor to wait for forked (separate JVM) contexts to initialize spark.context-settings.forked-jvm-init- timeout 30s Timeout for forked JVM to spin up and acquire resources spark.jobserver.short-timeout 3s The ask pattern timeout for API spark.jobserver.yarn-context-creation- timeout 40s in yarn deployment, time out for job server to wait while creating contexts spark.jobserver.yarn-context-deletion- timeout 40s in yarn deployment, time out for job server to wait while deleting contexts spray.can.server.idle-timeout 60s Spray can HTTP server idle timeout spray.can.server.request-timeout 40s Spray can HTTP server request timeout, idle-timeout should always greater than request-timeout
  • 40. Complex Data Representation in Informatica Developer Tool Struct Array Map Primitives
  • 41. Monitoring: Binaries ▪ Possible to add/remove binaries ▪ Upload job binary and execute the job at any time Uploaded Jar
  • 42. Monitoring: Spark Context ▪ Lists the running Spark contexts ▪ Can stop Spark context from UI Spark Context name Spark history server URL Kill job
  • 44. Monitoring: Yarn Job ▪ Long running Spark context ▪ Impersonation username ▪ Execution status Yarn application Name Impersonation user
  • 45. Demo
  • 46. Q&A
  • 47. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.