SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Dependency Injection in
Apache Spark Applications
Sandeep Prabhakar & Shu Das
Overview
● Signals overview
● How we use Spark in the Signals team at Salesforce
● An example of a simple Spark application
● Injecting dependencies to a Spark application
● Pitfalls of Dependency Injection in Spark and how to overcome them
Signals: a platform for making sense of activity streams
● A platform for extracting important
insights from a large volume of activity
● Activity includes emails, meetings, phone
calls, web clicks, news, SMS
urgent
email
suggested
follow-up
topic alert
pricing
mentioned
meeting
request
negative
sentiment
Spark in Signals team at Salesforce
● Spark Structured Streaming applications
● Applications process emails read from Kafka and S3
● Perform enrichment, aggregations on data over time windows
● Write to Kafka and Postgres
● Deployed on Apache Mesos
Apache
Kafka
AWS
S3
Apache
Kafka
Simple Spark application
Simple Spark application
val spark = SparkSession.builder.appName("Example").master("local").getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that saves word counts to redis
val query = wordCounts.writeStream
.foreach(new BasicRedisWriter)
.outputMode("update")
.start()
query.awaitTermination()
Desired functionalities
● Easily testable
● Components with single responsibility (separation of concerns)
● Ability to compose and reuse dependencies
● Ability to use different implementations of dependencies
● Configurable between dev, test, and prod environments
Using Guice to inject dependencies
Dependency Injection
● Technique where one object supplies the dependencies of another object
● A dependency is an object that can be used (a service)
● An injection is the passing of a dependency to a dependent object (a client)
that would use it
Summarized from Wikipedia: https://en.wikipedia.org/wiki/Dependency_injection
Guice
● DI framework from Google
● Easy to use and highly configurable
● Framework agnostic, implements JSR-330 (javax.inject)
Using Guice to inject dependencies
// Inject RedisClient
class GuiceRedisWriter @Inject()(redisClient: RedisClient) extends ForeachWriter[Row] {
...
}
// Inject the abstract ForeachWriter[Row]. Guice module will set the proper implementation
class GuiceExample @Inject()(writer: ForeachWriter[Row]) {
def countWords(spark: SparkSession, lines: DataFrame): StreamingQuery = { ... }
}
// Guice Module that provides implementations for dependencies
class GuiceExampleModule extends AbstractModule with ScalaModule {
@Provides @Singleton
def provideRedisClient(): RedisClient = new RedisClient("localhost", 6379)
@Provides @Singleton
def provideForeachWriter(redis: RedisClient): ForeachWriter[Row] = new GuiceRedisWriter(redis)
}
def main(args: Array[String]): Unit = {
// Create the injector and get instance of class
val injector = Guice.createInjector(new GuiceExampleModule)
val wordCounter = injector.getInstance(classOf[GuiceExample])
// Create Spark Session and Stream. Then call countWords of GuiceExample instance
}
Spark Serialization Exception
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
...
Caused by: java.io.NotSerializableException: com.redis.RedisClient
Serialization stack:
- object not serializable (class: com.redis.RedisClient, value: localhost:6379)
- field (class: com.salesforceiq.spark.examples.guice.GuiceRedisWriter, name: redisClient, type:
class com.redis.RedisClient)
- object (class com.salesforceiq.spark.examples.guice.GuiceRedisWriter,
com.salesforceiq.spark.examples.guice.GuiceRedisWriter@2aef90ef)
...
Why Guice fails on Spark
● Internally Spark ships tasks to executors
● These tasks should be serializable to be transmitted
● Guice instantiates dependencies on the Spark driver at application start
● Not all dependencies can be serialized
○ Dependencies with network connections (Redis Client, Postgres Client)
○ 3rd party libraries that are not in our control
● Spark tries to serialize these dependencies but fails
How to solve this
● Need to serialize non-serializable dependencies?!?!
How to solve this
● Need to serialize non-serializable dependencies?!?!
● Serialize the configuration for a dependency (aka Guice providers)
● Construct the instances of the dependencies on the executors
How to solve this
● Need to serialize non-serializable dependencies?!?!
● Serialize the configuration for a dependency (aka Guice providers)
● Construct the instances of the dependencies on the executors
● Basically, serialize the injector and inject dependencies on the executors
Injector Provider
Injector Provider
● An internal library written at SalesforceIQ (soon to be open sourced)
● A wrapper on Guice that creates a serializable injector
● Creates an injector that can lazily load modules and dependencies
Spark with Injector Provider
● Spark ships the serialized injector along with the task to the executors
● On task deserialization
○ The injector is deserialized
○ All the dependencies are injected
Using Injector Provider
class InjectorProviderExampleModule extends AbstractModule {
@Provides @Singleton
def provideForeachWriter(stub: ProvidedInjectorStub, redisClient: RedisClient): ForeachWriter[Row] = {
new InjectorProviderRedisWriter(stub, redisClient)
}
}
class InjectorProviderRedisWriter @Inject()(stub: ProvidedInjectorStub, _redisClient: RedisClient) extends
ForeachWriter[Row] {
// Make the RedisClient transient and Injectable so it does not get serialized by the JVM
@Inject @transient
private val redisClient = _redisClient
// Deserialize this object and then use stub to inject all members
private def readObject(in: ObjectInputStream): Unit = {
in.defaultReadObject()
stub.injectMembers(this)
}
...
}
// Extend abstract class which internally injects all @Inject annotated objects
class InjectorProviderExample @Inject() (writer: ForeachWriter[Row]) extends ProvidedInjector { /* Same */ }
def main(args: Array[String]): Unit = {
// Create the injector and get instance
val injector = InjectorProvider.builder().addBootstrapModuleTypes(classOf[InjectorProviderModule]).build()
val wordCounter = injector.getInstance(classOf[InjectorProviderExample])
// Create Spark Session and Stream. Then call countWords of GuiceExample instance
}
Tradeoffs of Injector Provider
● Not straightforward
● Tied to Guice as your DI framework
Conclusions
● Creating modular Spark jobs is not easy
● Dependency injection (DI) in Spark isn’t straightforward
● Spark tasks must be serializable
● Plain old Guice does not work on Spark
● Using an injector that is serializable makes DI possible in Spark
● Injector Provider (soon to be open sourced) gives the ability to build
serializable injectors

Contenu connexe

Tendances

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger InternalsNorberto Leite
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 

Tendances (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Spark
SparkSpark
Spark
 
Angular 2 observables
Angular 2 observablesAngular 2 observables
Angular 2 observables
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentTemporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, Confluent
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 

Similaire à Dependency Injection in Apache Spark Applications

Scal`a`ngular - Scala and Angular
Scal`a`ngular - Scala and AngularScal`a`ngular - Scala and Angular
Scal`a`ngular - Scala and AngularKnoldus Inc.
 
Divide and Conquer – Microservices with Node.js
Divide and Conquer – Microservices with Node.jsDivide and Conquer – Microservices with Node.js
Divide and Conquer – Microservices with Node.jsSebastian Springer
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Loiane Groner
 
Java Spring MVC Framework with AngularJS by Google and HTML5
Java Spring MVC Framework with AngularJS by Google and HTML5Java Spring MVC Framework with AngularJS by Google and HTML5
Java Spring MVC Framework with AngularJS by Google and HTML5Tuna Tore
 
springmvc-150923124312-lva1-app6892
springmvc-150923124312-lva1-app6892springmvc-150923124312-lva1-app6892
springmvc-150923124312-lva1-app6892Tuna Tore
 
Effective JavaFX architecture with FxObjects
Effective JavaFX architecture with FxObjectsEffective JavaFX architecture with FxObjects
Effective JavaFX architecture with FxObjectsSrikanth Shenoy
 
GDSC NITS Presents Kickstart into ReactJS
GDSC NITS Presents Kickstart into ReactJSGDSC NITS Presents Kickstart into ReactJS
GDSC NITS Presents Kickstart into ReactJSPratik Majumdar
 
OpenDaylight and YANG
OpenDaylight and YANGOpenDaylight and YANG
OpenDaylight and YANGCoreStack
 
Go swagger tutorial how to create golang api documentation using go swagger (1)
Go swagger tutorial how to create golang api documentation using go swagger (1)Go swagger tutorial how to create golang api documentation using go swagger (1)
Go swagger tutorial how to create golang api documentation using go swagger (1)Katy Slemon
 
Meetup 2022 - APIs with Quarkus.pdf
Meetup 2022 - APIs with Quarkus.pdfMeetup 2022 - APIs with Quarkus.pdf
Meetup 2022 - APIs with Quarkus.pdfLuca Mattia Ferrari
 
Angular 2 Migration - JHipster Meetup 6
Angular 2 Migration - JHipster Meetup 6Angular 2 Migration - JHipster Meetup 6
Angular 2 Migration - JHipster Meetup 6William Marques
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 
Developing Microservices using Spring - Beginner's Guide
Developing Microservices using Spring - Beginner's GuideDeveloping Microservices using Spring - Beginner's Guide
Developing Microservices using Spring - Beginner's GuideMohanraj Thirumoorthy
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)James Titcumb
 
How and why i roll my own node.js framework
How and why i roll my own node.js frameworkHow and why i roll my own node.js framework
How and why i roll my own node.js frameworkBen Lin
 

Similaire à Dependency Injection in Apache Spark Applications (20)

Scal`a`ngular - Scala and Angular
Scal`a`ngular - Scala and AngularScal`a`ngular - Scala and Angular
Scal`a`ngular - Scala and Angular
 
Divide and Conquer – Microservices with Node.js
Divide and Conquer – Microservices with Node.jsDivide and Conquer – Microservices with Node.js
Divide and Conquer – Microservices with Node.js
 
Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018Angular for Java Enterprise Developers: Oracle Code One 2018
Angular for Java Enterprise Developers: Oracle Code One 2018
 
Liferay (DXP) 7 Tech Meetup for Developers
Liferay (DXP) 7 Tech Meetup for DevelopersLiferay (DXP) 7 Tech Meetup for Developers
Liferay (DXP) 7 Tech Meetup for Developers
 
Java Spring MVC Framework with AngularJS by Google and HTML5
Java Spring MVC Framework with AngularJS by Google and HTML5Java Spring MVC Framework with AngularJS by Google and HTML5
Java Spring MVC Framework with AngularJS by Google and HTML5
 
springmvc-150923124312-lva1-app6892
springmvc-150923124312-lva1-app6892springmvc-150923124312-lva1-app6892
springmvc-150923124312-lva1-app6892
 
Effective JavaFX architecture with FxObjects
Effective JavaFX architecture with FxObjectsEffective JavaFX architecture with FxObjects
Effective JavaFX architecture with FxObjects
 
GDSC NITS Presents Kickstart into ReactJS
GDSC NITS Presents Kickstart into ReactJSGDSC NITS Presents Kickstart into ReactJS
GDSC NITS Presents Kickstart into ReactJS
 
OpenDaylight and YANG
OpenDaylight and YANGOpenDaylight and YANG
OpenDaylight and YANG
 
Express node js
Express node jsExpress node js
Express node js
 
Go swagger tutorial how to create golang api documentation using go swagger (1)
Go swagger tutorial how to create golang api documentation using go swagger (1)Go swagger tutorial how to create golang api documentation using go swagger (1)
Go swagger tutorial how to create golang api documentation using go swagger (1)
 
Meetup 2022 - APIs with Quarkus.pdf
Meetup 2022 - APIs with Quarkus.pdfMeetup 2022 - APIs with Quarkus.pdf
Meetup 2022 - APIs with Quarkus.pdf
 
Angular 2 Migration - JHipster Meetup 6
Angular 2 Migration - JHipster Meetup 6Angular 2 Migration - JHipster Meetup 6
Angular 2 Migration - JHipster Meetup 6
 
Explaination of angular
Explaination of angularExplaination of angular
Explaination of angular
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Developing Microservices using Spring - Beginner's Guide
Developing Microservices using Spring - Beginner's GuideDeveloping Microservices using Spring - Beginner's Guide
Developing Microservices using Spring - Beginner's Guide
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
LoopbackJS the intro
LoopbackJS the introLoopbackJS the intro
LoopbackJS the intro
 
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)
Kicking off with Zend Expressive and Doctrine ORM (ConFoo YVR 2017)
 
How and why i roll my own node.js framework
How and why i roll my own node.js frameworkHow and why i roll my own node.js framework
How and why i roll my own node.js framework
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 

Dernier (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 

Dependency Injection in Apache Spark Applications

  • 1. Dependency Injection in Apache Spark Applications Sandeep Prabhakar & Shu Das
  • 2. Overview ● Signals overview ● How we use Spark in the Signals team at Salesforce ● An example of a simple Spark application ● Injecting dependencies to a Spark application ● Pitfalls of Dependency Injection in Spark and how to overcome them
  • 3. Signals: a platform for making sense of activity streams ● A platform for extracting important insights from a large volume of activity ● Activity includes emails, meetings, phone calls, web clicks, news, SMS urgent email suggested follow-up topic alert pricing mentioned meeting request negative sentiment
  • 4. Spark in Signals team at Salesforce ● Spark Structured Streaming applications ● Applications process emails read from Kafka and S3 ● Perform enrichment, aggregations on data over time windows ● Write to Kafka and Postgres ● Deployed on Apache Mesos Apache Kafka AWS S3 Apache Kafka
  • 6. Simple Spark application val spark = SparkSession.builder.appName("Example").master("local").getOrCreate() import spark.implicits._ // Create DataFrame representing the stream of input lines val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() // Start running the query that saves word counts to redis val query = wordCounts.writeStream .foreach(new BasicRedisWriter) .outputMode("update") .start() query.awaitTermination()
  • 7. Desired functionalities ● Easily testable ● Components with single responsibility (separation of concerns) ● Ability to compose and reuse dependencies ● Ability to use different implementations of dependencies ● Configurable between dev, test, and prod environments
  • 8. Using Guice to inject dependencies
  • 9. Dependency Injection ● Technique where one object supplies the dependencies of another object ● A dependency is an object that can be used (a service) ● An injection is the passing of a dependency to a dependent object (a client) that would use it Summarized from Wikipedia: https://en.wikipedia.org/wiki/Dependency_injection
  • 10. Guice ● DI framework from Google ● Easy to use and highly configurable ● Framework agnostic, implements JSR-330 (javax.inject)
  • 11. Using Guice to inject dependencies // Inject RedisClient class GuiceRedisWriter @Inject()(redisClient: RedisClient) extends ForeachWriter[Row] { ... } // Inject the abstract ForeachWriter[Row]. Guice module will set the proper implementation class GuiceExample @Inject()(writer: ForeachWriter[Row]) { def countWords(spark: SparkSession, lines: DataFrame): StreamingQuery = { ... } } // Guice Module that provides implementations for dependencies class GuiceExampleModule extends AbstractModule with ScalaModule { @Provides @Singleton def provideRedisClient(): RedisClient = new RedisClient("localhost", 6379) @Provides @Singleton def provideForeachWriter(redis: RedisClient): ForeachWriter[Row] = new GuiceRedisWriter(redis) } def main(args: Array[String]): Unit = { // Create the injector and get instance of class val injector = Guice.createInjector(new GuiceExampleModule) val wordCounter = injector.getInstance(classOf[GuiceExample]) // Create Spark Session and Stream. Then call countWords of GuiceExample instance }
  • 12. Spark Serialization Exception org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) ... Caused by: java.io.NotSerializableException: com.redis.RedisClient Serialization stack: - object not serializable (class: com.redis.RedisClient, value: localhost:6379) - field (class: com.salesforceiq.spark.examples.guice.GuiceRedisWriter, name: redisClient, type: class com.redis.RedisClient) - object (class com.salesforceiq.spark.examples.guice.GuiceRedisWriter, com.salesforceiq.spark.examples.guice.GuiceRedisWriter@2aef90ef) ...
  • 13. Why Guice fails on Spark ● Internally Spark ships tasks to executors ● These tasks should be serializable to be transmitted ● Guice instantiates dependencies on the Spark driver at application start ● Not all dependencies can be serialized ○ Dependencies with network connections (Redis Client, Postgres Client) ○ 3rd party libraries that are not in our control ● Spark tries to serialize these dependencies but fails
  • 14. How to solve this ● Need to serialize non-serializable dependencies?!?!
  • 15. How to solve this ● Need to serialize non-serializable dependencies?!?! ● Serialize the configuration for a dependency (aka Guice providers) ● Construct the instances of the dependencies on the executors
  • 16. How to solve this ● Need to serialize non-serializable dependencies?!?! ● Serialize the configuration for a dependency (aka Guice providers) ● Construct the instances of the dependencies on the executors ● Basically, serialize the injector and inject dependencies on the executors
  • 18. Injector Provider ● An internal library written at SalesforceIQ (soon to be open sourced) ● A wrapper on Guice that creates a serializable injector ● Creates an injector that can lazily load modules and dependencies
  • 19. Spark with Injector Provider ● Spark ships the serialized injector along with the task to the executors ● On task deserialization ○ The injector is deserialized ○ All the dependencies are injected
  • 20. Using Injector Provider class InjectorProviderExampleModule extends AbstractModule { @Provides @Singleton def provideForeachWriter(stub: ProvidedInjectorStub, redisClient: RedisClient): ForeachWriter[Row] = { new InjectorProviderRedisWriter(stub, redisClient) } } class InjectorProviderRedisWriter @Inject()(stub: ProvidedInjectorStub, _redisClient: RedisClient) extends ForeachWriter[Row] { // Make the RedisClient transient and Injectable so it does not get serialized by the JVM @Inject @transient private val redisClient = _redisClient // Deserialize this object and then use stub to inject all members private def readObject(in: ObjectInputStream): Unit = { in.defaultReadObject() stub.injectMembers(this) } ... } // Extend abstract class which internally injects all @Inject annotated objects class InjectorProviderExample @Inject() (writer: ForeachWriter[Row]) extends ProvidedInjector { /* Same */ } def main(args: Array[String]): Unit = { // Create the injector and get instance val injector = InjectorProvider.builder().addBootstrapModuleTypes(classOf[InjectorProviderModule]).build() val wordCounter = injector.getInstance(classOf[InjectorProviderExample]) // Create Spark Session and Stream. Then call countWords of GuiceExample instance }
  • 21. Tradeoffs of Injector Provider ● Not straightforward ● Tied to Guice as your DI framework
  • 22. Conclusions ● Creating modular Spark jobs is not easy ● Dependency injection (DI) in Spark isn’t straightforward ● Spark tasks must be serializable ● Plain old Guice does not work on Spark ● Using an injector that is serializable makes DI possible in Spark ● Injector Provider (soon to be open sourced) gives the ability to build serializable injectors