Serverless for High Performance Computing

Serverless for HPC
Luciano Mammino
fourTheorem
@loige

Diamond Sponsor
Partner
Platinum Sponsor Gold Sponsor

👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino

Middy Framework
SLIC Starter - Serverless Accelerator
SLIC Watch - Observability Plugin
Business focused technologists.
Accelerated Serverless | AI as a Service | Platform Modernisation

We host a podcast about AWS and Cloud computing
🔗 awsbites.com
🎬 YouTube Channel
🎙 Podcast
📅 Episodes every week
@loige
#CLOUDDAY2022

Get the slides: fth.link/cd22
@loige
#CLOUDDAY2022

Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
fth.link/cd22
@loige
#CLOUDDAY2022

The 6 Rs of Cloud Migrations
🗑 🕸 🚚
Retire Retain Rehost
🏗 📐 💰
Replatform Refactor Repurchase
@loige
#CLOUDDAY2022
fth.link/cd22

A case study
Case study on AWS blog:
fth.link/awshpc
@loige
#CLOUDDAY2022

The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
#CLOUDDAY2022

The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
#CLOUDDAY2022

Original on-prem implementation
@loige
#CLOUDDAY2022

Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
#CLOUDDAY2022

Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
#CLOUDDAY2022

The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
#CLOUDDAY2022

Architecture Options for Compute/Orchestration
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
@loige
#CLOUDDAY2022

POC Architecture
AWS Batch
S3
Step Functions
Lambda
SQS
@loige
#CLOUDDAY2022

Measure Everything! 📏
⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
#CLOUDDAY2022

Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to
be addressed next!
@loige
#CLOUDDAY2022

Beyond the PoC
Production: optimise for unique workload characteristics
@loige
#CLOUDDAY2022

In reality, not all jobs are alike!
@loige
#CLOUDDAY2022

Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
#CLOUDDAY2022

Moving to production 🚢
@loige
#CLOUDDAY2022

Actual End to End overview
@loige
#CLOUDDAY2022

Modelling Worker
@loige
#CLOUDDAY2022

Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
#CLOUDDAY2022

Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
#CLOUDDAY2022

Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
#CLOUDDAY2022

Lifecycle of a Job
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
@loige
#CLOUDDAY2022

Event-Driven Scheduler
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
@loige
#CLOUDDAY2022

Dynamic Runtime
Handling
We also need to handle
system failures!
@loige
#CLOUDDAY2022

Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver new
capabilities
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
@loige
#CLOUDDAY2022

Hitting the limits 😰
@loige
#CLOUDDAY2022

S3 Throughput
@loige
#CLOUDDAY2022

S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
#CLOUDDAY2022

The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
@loige
#CLOUDDAY2022

Fargate Scaling
●We want to run 3000 containers ASAP
●This took > 1 hour!
●We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
●3000 containers in ~20 minutes
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
@loige
#CLOUDDAY2022

How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
#CLOUDDAY2022

Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
#CLOUDDAY2022

Special thanks to @eoins and @cmthorne10
fth.link/cd22
@loige
#CLOUDDAY2022

Serverless for High Performance Computing

Recommandé

Recommandé

Contenu connexe

Similaire à Serverless for High Performance Computing

Similaire à Serverless for High Performance Computing (20)

Plus de Luciano Mammino

Plus de Luciano Mammino (20)

Dernier

Dernier (20)

Serverless for High Performance Computing