We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
5. Run as many queries as possible in parallel on top a
denormalized dataframe
• foo = 1
Query 1
• bar.baz > 120
Query 2
• state in [CA, NY]
Query 3
Query 1000
ProfileIds field1 field1000 eventsArray
a@a.com a x [e1,2,3]
b@g.com b x [e1]
d@d.com d y [e1,2,3]
z@z.com z y [e1,2,3,5,7]
6. What do we need?
• Long Running Spark Batch Job
• Dispatch New Jobs by polling a Redis Queue
• We want to parametrize a Spark Action repeatedly for
interactive results
• E.g. Submit custom queries on top a table
• We load the data once query N times
• Bringing up a Spark Cluster per job has a latency cost
• Wasted time doing same initialization actions multiple times.
• Possible Multi tenancy
8. Why not Structured Streaming?
• Lack of access to Spark Context within executor context
• Can’t do a spark action on top of dataframe that is
already loaded in the driver unless you do a join
• Doing a join is extremely limited
9. Working Solution Summary
• Blocking POP on Redis inside driver and use Command
Pattern to send queries to rediscover queue
• Consume the commands and trigger spark actions using a
FAIR scheduler
• Communicate status of job through a micro
service/database or Redis itself!
10. Session Workflow – Spark
Continuous Session
10
Submit
Query API
Spark Driver
Executor 1
Executor N
Fetch
Results
Executor Logic
API
1. POST /preview
2. Check if result in Cache
1. GET /preview/<previewID> 2. Fetch Counters from Redis
3. Push <query> into queue
4. Pop queries till
queue is empty
[q1, q2, q3, q100]
Sample
Dataframe
Sample
Dataframe
partition
1
partition
2
partition
1
15. What is wrong with Accumulators?
• Repeated Task Execution - Non idempotency
• Task Failures and Retries
• Re-using stage in repeated operations
• Speculative Execution
• Memory pressure on driver on collect()
• Can’t access per partition stats programmatically AFAIK