Contenu connexe Similaire à Parallelization of Structured Streaming Jobs Using Delta Lake (20) Parallelization of Structured Streaming Jobs Using Delta Lake1. © Tubi, proprietary and confidential
Parallelization of Structured Streaming Jobs
using Delta
- Oliver Lewis, Sr. Data Engineer
3. © Tubi, proprietary and
confidential
3
Datalake Throughput
Requests/s
40,000
Aggregate Records/day
800M
Volume/day
500GB
6. © Tubi, proprietary and
confidential
6
Engineering Challenges w/ Stream First Architecture
• Datalake file right-sizing
• Backfill / Data Deletion Process is a nightmare
• Multiple Streams writing to the same location
7. © Tubi, proprietary and
confidential
7
Delta @ Tubi
• Optimization of ingested parquet files
• Data Deletion Use cases (GDPR/CCPA)
• _spark_metadata failures in backfill operations
9. © Tubi, proprietary and
confidential
9
Strategies to Backfill
1) Write a batch job to backfill.
1) Gracefully terminate the streaming job.
Gotcha: Do not replace readStream/read and writeStream/write because
implicitly flatMapGroupsWithState is converted to mapGroups and you’ll lose
state management entirely.
10. © Tubi, proprietary and
confidential
10
Issues in backfilling large datasets
We set the start_date to 2016-01-01 and the end_date to 2020-05-31 and run
the job.
There are several problems in structuring the job like this:
1) It would be a nuisance if the job ran for a long time before failing.
2) State management cannot hold such a large state.
11. © Tubi, proprietary and
confidential
11
Encapsulate the Task.
What we need is a small batch size that can be
● TRIGGERED
● EXECUTED
● COMPLETED
So that at any given time we do not store too much
state on the executors and also clearly track
completion.
At scale, any date can be sent as input and we would
be able to generate the same output. I.e. we should
have an idempotent task.
12. © Tubi, proprietary and
confidential
12
Performance
To make the backfill go faster our immediate intuition
is to increase the size of the cluster.
Example: 3886 tasks and we have 64 cores and it
took 8.2 mins.
If we have 3886 cores we can complete this job in ~8
secs.
So our intuition is CORRECT.
Increasing the cluster size is useful until the number
of cores is less than or equal to the most expensive
task.
13. © Tubi, proprietary and
confidential
13
Performance
● But if the number of cores is greater than tasks,
then you have a large cluster that is not being
fully utilized.
● This is an important limitation of our initial
intuition that by spinning up a larger cluster we
can increase performance, that isn't always
true
16. © Tubi, proprietary and
confidential
16
Backfilling in parallel
1) Separating the business logic from the execution logic.
1) We can run multiple streams in parallel. Each job is submitted to the spark scheduler
which will be responsible for the execution of the job depending on the number of free
cores available.
1) Using scala parallel collections (.par)
17. © Tubi, proprietary and
confidential
17
Par collections limitations
1) Rob Pike: Concurrency is not parallelism.
1) Par collections do launch Spark jobs in parallel, but the Spark scheduler may not
actually execute the jobs in parallel.
18. © Tubi, proprietary and
confidential
18
Futures and Fair Scheduler Pool
1) By default, each pool gets an equal share of the cluster, but inside each pool, jobs run
in FIFO order.
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
1) You can further configure the pools schedulerMode, minShare, and weight.
19. © Tubi, proprietary and
confidential
19
DEMO
https://dbc-5f2acc18-
b29d.cloud.databricks.com/?o=6211501775943445#notebook/3133024054071987/
21. © Tubi, proprietary and
confidential
21
Failure and Recovery handling
● We should be able to handle failure and
retries within the job.
● Build a simple StateStore to monitor
which state the job is currently in.
● If the job has successfully finished then
we can remove it from the state store.
22. © Tubi, proprietary and confidential
Thank You.
https://corporate.tubitv.com/company/careers/
Blog: https://code.tubitv.com/
Contact:
https://www.linkedin.com/in/oliveralewis/
olewis@tubi.tv