As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case.
Faster Data Integration Pipeline Execution using Spark-Jobserver
1.
2. Faster Data Integration Pipeline
Execution using Spark Job Server
Presenter:
Sailee Jain and Prabhakar Gouda
3. Who are we?
▪ Sailee Jain
▪ Senior Software Engineer at Informatica
▪ ~6 years of experience working on various
flavors of Data Engineering products
▪ LinkedIn -
https://www.linkedin.com/in/saileejain/
▪ Prabhakar Gouda
▪ Senior Software Engineer at Informatica
▪ ~8 years of experience in Software industry
▪ LinkedIn -
https://www.linkedin.com/in/prabhakarGouda/
5. Agenda
Informatica Big Data ETL
Complex Data Types and
Associated Challenges
Data Preview Use-case
Informatica Product
Architecture
Integrating Spark Job
server with Informatica
Demo Q&A
Configure and Tune
Spark Job server
8. Dealing with buggy pipelines
▪ Where is the error?
▪ Is it due to wrong choice of data
types?
▪ Is it due to incorrect usage of
transformation?
▪ Which transformation?
▪ Check the data after each midstream
transformation
10. Data Preview – Feature Requirements
▪ Ability to comprehend Complex Data types (e.g. – map, struct, array)
▪ Support variety of Data Sources (Cassandra, HDFS, S3, Azure etc.)
▪ Faster execution (Tradeoff execution time with data size)
▪ Work with minimal changes to existing codebase
▪ Support all existing spark features and Informatica transformations
12. What spark-submit based data preview achieved?
Feature Supportability
Complex Data types
Support variety of Data Sources
Faster execution
minimal changes to existing codebase
existing spark features/ transformations Support
15. Spark Job Server
▪ Provides a RESTful interface for submitting and managing Apache
Spark jobs, jars, and job contexts.
▪ Well documented
▪ Active community
▪ Easy integration
▪ Suitable for low latency queries
Interchangeably referred as SJS or Job Server
16. Compare Spark-submit with Spark Job Server
Metrics Spark-submit Spark Job Server
Spark context sharing
across the jobs
Not Supported, every job runs as
a new Yarn application
Allows context sharing
across jobs
Named object sharing
across the jobs
Not Supported Allows RDD sharing,
allows data-frame
sharing
19. Execution Flow
Informatica Client
Informatica
Server
Spark Job Server Hadoop Cluster
Start Spark Job Server
Create Spark context
Create Spark context
Submit spark task for execution
Return result
If SJS is idle for
long time Delete Spark context
Stop Spark Job Server
Execute using Spark context
Create Data
Pipeline Execute preview
Execute using Spark context
Second data preview request
Submit spark task for execution
Return result
20. Spark Job Server vs Spark-submit
Ø Spark Job Server is on par with Spark-submit
for the first run
Ø Subsequent runs are faster because of the
shared Spark context
Ø Along with helping our customers, it helps
developers (like us) to get visual feedback of
our data while handling production pipeline
bugs, ensuring quicker RCA
Informatica Server Configuration
Cores: 2 x 12-core, Memory: 128 GB
Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz
Operating Sys: RedHat Ent. Linux 7.0
Cluster Node Configuration
12 nodes Cloudera 6.1
Cores: 24-core, Memory: 256 GB
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Operating Sys: RedHat Ent. Linux 7.0
22. Setup Details
Ø Spark Job Server is configured on the same
host as Informatica Server
Ø Spark Job Server version - 0.9.0
Ø Spark deploy mode - YARN cluster
Ø Supported Hadoop distribution vendors
Ø EMR
Ø Cloudera
Ø Hortonworks
Ø Cloudera Data Platform
23. Getting started
▪ Get SJS source code from git (https://github.com/spark-jobserver/spark-
jobserver)
▪ Install compatible version of sbt(defined in build.properties)
▪ Create following file copies and edit as appropriate -
▪ local.sh.template: Script template for setting the environment variables required to start
SJS
▪ local.conf.template: TypeSafeConfig template file for defining the Job Server
configuration
▪ Execute server_package.sh <env> to generate spark-job-
server.jar
24. Environment Variables (local.sh.template)
Environment variable Purpose
PIDFILE Job Server process id file
JOBSERVER_MEMORY defaults to 1G, the amount of memory (eg 512m, 2G) to give to job
server
MAX_DIRECT_MEMORY Job Server's value for -XX:MaxDirectMemorySize option
LOG_DIR Job Server log directory
MANAGER_EXTRA_SPARK_CONFS Spark extra configurations
SPARK_VERSION Spark version
SCALA_VERSION Scala version, Example: 2.11.8
SPARK_HOME SPARK_HOME directory on the Job Server machine
YARN_CONF_DIR and HADOOP_CONF_DIR Directory containing all site-xmls (core, yarn, hbase, hive etc)
SPARK_CONF_DIR Directory containing Spark configuration file
25. Application Code Migration
Ø Create a Jar containing application logic
Ø Launch a spark-submit with the application jar
and entry-point class
Ø Modify the entry point class to extend from
SparkSessionJob
Ø Create a Jar containing application logic.
Ø REST request to upload the application jar
Ø REST request to submit the application jar for
execution
Spark Job ServerSpark-submit
27. Running Jobs
1. Create spark-context (shared/per job)
curl -d "" "localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m"
OK⏎
2. Upload job binary
curl -X POST localhost:8090/binaries/test -H "Content-Type: application/java-archive" --data-binary
@/<path.to.your.jar>
OK⏎
3. Submit job for execution
curl -d "input.string = a b c a b see"
"localhost:8090/jobs?appName=test&classPath=<entryClass>&context=test-context&sync=true
28. Useful REST APIs
▪ /data
▪ POST /data/<prefix> - Uploads a new file
▪ /binaries
▪ POST /binaries/<appName> - upload a new binary file
▪ DELETE /binaries/<appName> - delete defined binary
▪ /jobs
▪ POST /jobs - Starts a new job, use ?sync=true to wait for results
▪ GET /jobs/<jobId> - Gets the result or status of a specific job
▪ DELETE /jobs/<jobId> - Kills the specified job
▪ /contexts
▪ POST /contexts/<name> - creates a new context
▪ DELETE /contexts/<name> - stops a context and all jobs running in it.
30. Handling Job Dependencies
Ø Traditionally spark-submit provides --files, --archives and --jars option
to localize the resources on the cluster nodes
Ø Equivalent properties in spark-conf are -
--file = spark.yarn.dist.files
--jars = spark.yarn.dist.jars
--archives = spark.yarn.dist.archives
Ø Honored only once at the time of spark-context creation
Ø Job specific jars can be provided using dependent-jar-uris or cp context
configuration param
31. Multiple Spark Job Servers
▪ One Job Server instance can execute job on only one Hadoop cluster.
▪ For running multiple Job Server instances on the same host configure
the following ports:-
▪ JMX port - monitoring
▪ HTTP port - Job Server HTTP port
▪ H2DB port - Required only if you are using H2DB for metadata management
32. Concurrency
▪ Maximum jobs per context
▪ spark.jobserver.max-jobs-per-context = <concurrenyCount>
▪ If not set, defaults to number of cores on machine, where Job Server is running
▪ Spark task level concurrency
▪ Too few partitions – Cannot utilize all cores available in the cluster
▪ Too many partitions – Excessive overhead in managing many small tasks
▪ Rule: partition count = (Input Data size)/(size per partition)
For example: If you are working with 5120 MB of source data and 128 MB HDFS partition then
set, 5120 / 128 = 40
--conf spark.sql.shuffle.partitions = 40
--conf spark.default.parallelism = 40
33. Dependency conflicts
Ø Spark job server is an uber jar
Ø Adding uber jars to your classpath can result in version conflicts
Ø Solutions
1. spark.driver.userClassPathFirst and spark.executor.userClassPathFirst
2. Sync dependency versions in
▪ spark-jobserver/project/Dependencies.scala
3. Jar shading
Modify assembly here - spark-jobserver/project/Assembly.scala
34. Support for Kerberos
▪ Using Kerberos Principal and Keytab
▪ Add following properties to Spark configuration file
▪ spark.yarn.principal: User Kerberos principal
▪ spark.yarn.keytab: keytab file location on the Job Server host
▪ Spark context is started using Job Server user.
▪ Using Impersonation User
▪ Generate the Kerberos token
▪ Add spark.yarn.proxy-user=<ImpersonationUser> in Spark configuration file
▪ Spark context will be started using Impersonation user
35. HTTPS/SSL Enabled Server
▪ Add following properties in Job Server configuration file
▪ spray.can.server.keystore
▪ spray.can.server.keystorePW
▪ spray.can.server.ssl-encryption
36. Logging
▪ Adding Job-Id makes debugging easier
▪ Log format and logger level can be controlled from
▪ Log4j.properties
log4j.appender.console.layout.ConversionPattern=[%d] %-5p %.26c [%X{jobId}] - %m%n
Purpose Default log file name
Logs from
server_start.sh
$LOG_DIR/server_start.log
Spark job server logs $LOG_DIR/log/spark-job-server.log
Spark context logs $LOG_DIR/log/<uniqueId>/spark-job-server.out
38. Key Takeaways
▪ Increase timeouts
▪ Important consideration for Yarn cluster mode
▪ Remote clusters result in network delays which can cause failures due to timeouts
▪ Class-path issues
▪ Long running application – long live classpaths – resources once added are present for the entire life of the context
▪ Use unique package name to distinguish between applications
▪ Resources/memory configs become static per job
▪ Long running Spark context – Resource configurations can only be done at the time of Spark context creation
▪ Anticipate for load when creating the Spark context
▪ Executors keep-alive can enhance the performance
▪ Depending upon the usage-pattern - if you have steady load, then keeping the executors alive can enhance performance.
▪ Consider removing uploaded binaries and data at regular intervals
39. Timeouts (in local.conf.template)
Property Default
Value
Description
spark.context-settings.context-init-
timeout
60s Timeout for SupervisorActor to wait for forked (separate
JVM) contexts to initialize
spark.context-settings.forked-jvm-init-
timeout
30s Timeout for forked JVM to spin up and acquire resources
spark.jobserver.short-timeout 3s The ask pattern timeout for API
spark.jobserver.yarn-context-creation-
timeout
40s in yarn deployment, time out for job server to wait while
creating contexts
spark.jobserver.yarn-context-deletion-
timeout
40s in yarn deployment, time out for job server to wait while
deleting contexts
spray.can.server.idle-timeout 60s Spray can HTTP server idle timeout
spray.can.server.request-timeout 40s Spray can HTTP server request timeout, idle-timeout
should always greater than request-timeout