Serverless is great for web applications and APIs, but this does not mean it cannot be used successfully for other use cases. In this talk, we will discuss a successful application of serverless in the field of High Performance Computing. Specifically we will discuss how Lambda, Fargate, Kinesis and other serverless technologies are being used to run sophisticated financial models at one of the major reinsurance companies in the World. We we learn about the architecture, the tradeoffs, some challenges and some unresolved pain points. Most importantly, we'll find out if serverless can be a great fit for HPC and if we can finally stop managing those boring EC2 instances!
7. Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
fth.link/cd22
@loige
#CLOUDDAY2022
9. A case study
Case study on AWS blog:
fth.link/awshpc
@loige
#CLOUDDAY2022
10. The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
#CLOUDDAY2022
11. The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
#CLOUDDAY2022
13. Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
#CLOUDDAY2022
14. Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
#CLOUDDAY2022
15. The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
#CLOUDDAY2022
16. Architecture Options for Compute/Orchestration
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
@loige
#CLOUDDAY2022
18. Measure Everything! 📏
⏱ Built metrics in from the start
AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
#CLOUDDAY2022
19. Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to
be addressed next!
@loige
#CLOUDDAY2022
23. Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
#CLOUDDAY2022
28. Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
#CLOUDDAY2022
29. Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
#CLOUDDAY2022
30. Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
#CLOUDDAY2022
31. Lifecycle of a Job
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
@loige
#CLOUDDAY2022
32. Event-Driven Scheduler
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
@loige
#CLOUDDAY2022
34. Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver new
capabilities
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
@loige
#CLOUDDAY2022
37. S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
#CLOUDDAY2022
38. The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
@loige
#CLOUDDAY2022
39. Fargate Scaling
●We want to run 3000 containers ASAP
●This took > 1 hour!
●We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
●3000 containers in ~20 minutes
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
@loige
#CLOUDDAY2022
40. How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
#CLOUDDAY2022
41. Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
#CLOUDDAY2022
42. Special thanks to @eoins and @cmthorne10
fth.link/cd22
@loige
#CLOUDDAY2022