SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Serverless for HPC
Luciano Mammino
fourTheorem
@loige
Diamond Sponsor
Partner
Platinum Sponsor Gold Sponsor
👋 Hello, I am Luciano
Senior architect
nodejsdesignpatterns.com
Let’s connect:
🌎 loige.co
🐦 @loige
🎥 loige
🧳 lucianomammino
Middy Framework
SLIC Starter - Serverless Accelerator
SLIC Watch - Observability Plugin
Business focused technologists.
Accelerated Serverless | AI as a Service | Platform Modernisation
We host a podcast about AWS and Cloud computing
🔗 awsbites.com
🎬 YouTube Channel
🎙 Podcast
📅 Episodes every week
@loige
#CLOUDDAY2022
Get the slides: fth.link/cd22
@loige
#CLOUDDAY2022
Agenda
● The 6 Rs of Cloud Migration
● A serverless case study
○ The problem space and types of workflows
○ Original on premise implementation
○ The PoC
○ The final production version
○ The components of a serverless job scheduler
○ Challenges & Limits
fth.link/cd22
@loige
#CLOUDDAY2022
The 6 Rs of Cloud Migrations
🗑 🕸 🚚
Retire Retain Rehost
🏗 📐 💰
Replatform Refactor Repurchase
@loige
#CLOUDDAY2022
fth.link/cd22
A case study
Case study on AWS blog:
fth.link/awshpc
@loige
#CLOUDDAY2022
The workloads - Risk Rollup
🏦 Financial modeling to understand the portfolio of risk
🧠 Internal, custom-built risk model on all reinsurance deals
⚙ HPC (High-Performance Computing) workload
🗄 ~45TB data processed
⏱ 2/3 rollups per day (6-8 hours each!)
@loige
#CLOUDDAY2022
The workloads - Deal Analytics
⚡ Near real-time deal pricing using the same risk model
🗃 Lower data volumes
🔁 High frequency of execution – up to 1.000 per day
@loige
#CLOUDDAY2022
Original on-prem implementation
@loige
#CLOUDDAY2022
Challenges
🐢 Long execution times, constraining business agility
🥊 Competing workloads
📈 Limits our ability to support portfolio growth
😩 Can’t deliver new features
🧾 Very high total cost of ownership
@loige
#CLOUDDAY2022
Thinking Big
💭 Imagine a solution that would …
1. Offer a dramatic increase in performance
2. Provide consistent run times
3. Support more executions, more often
4. Support future portfolio growth and new
capabilities – 15x data volumes
@loige
#CLOUDDAY2022
The Goal ⚽
Run a Risk Rollup in 1 hour!
@loige
#CLOUDDAY2022
Architecture Options for Compute/Orchestration
AWS Lambda
Amazon SQS AWS Step Functions
AWS Fargate
Com t om :
Red he b to si l ,
s a l , ev -d i n
co n s
@loige
#CLOUDDAY2022
POC Architecture
AWS Batch
S3
Step Functions
Lambda
SQS
@loige
#CLOUDDAY2022
Measure Everything! 📏
⏱ Built metrics in from the start
󰤈 AWS metrics we wish existed out of the box:
- Number of running containers
- Success/failure counts
🎨 Custom metrics:
- Scheduler overhead
- Detailed timings (job duration, I/O time, algorithm steps)
🛠 Using CloudWatch, EMF
@loige
#CLOUDDAY2022
Measure Everything! 📏
👍 Rollup in 1 hour
☁ Running on AWS Batch
👎 Cluster utilisation was <50%
✅ Goal success
🤔 Understanding of what needs to
be addressed next!
@loige
#CLOUDDAY2022
Beyond the PoC
Production: optimise for unique workload characteristics
@loige
#CLOUDDAY2022
Job Plan
@loige
#CLOUDDAY2022
In reality, not all jobs are alike!
@loige
#CLOUDDAY2022
Horizontal scaling 🚀
1000’s of jobs
Duration: 1 second – 45 minutes
Scaling horizontally = splitting jobs
Jobs split according to their
complexity/duration
Resulting in >1 million jobs
@loige
#CLOUDDAY2022
Moving to production 🚢
@loige
#CLOUDDAY2022
Scope
@loige
#CLOUDDAY2022
Actual End to End overview
@loige
#CLOUDDAY2022
Modelling Worker
@loige
#CLOUDDAY2022
Compute Services
Scales to 1000’s of tasks (containers)
Little management overhead
Up to 4 vCPUs and 30GB Memory
Up to 200GB ephemeral storage
Scales to 1000’s of function containers (in seconds!)
Very little management overhead
Up to 6 vCPUs and 10GB Memory
Up to 10GB ephemeral storage
It wasn’t always this way!
@loige
#CLOUDDAY2022
Store all the things in S3!
The source of truth for:
● Input Data (JSON, Parquet)
● Intermediate Data (Parquet)
● Results (Parquet)
● Aggregates (Parquet)
Input data: 20GB
Output data: ~1 TB
Reads and writes: 10,000s of objects per second.
@loige
#CLOUDDAY2022
Scheduling and Orchestration
✅ We have our cluster (Fargate or Lambda)
✅ We have a plan! (list of jobs, parameters and
dependencies)
🤔 How do we feed this plan to the cluster?!
🤨 Existing schedulers use traditional clusters – there
is no serverless job scheduler for workloads like this!
@loige
#CLOUDDAY2022
Lifecycle of a Job
A new job
get queued
here 👇
A worker
picks up the
job and
executes it
The worker
emits the
job state
(success or
failure)
@loige
#CLOUDDAY2022
Event-Driven Scheduler
Job states are pulled
from a Kinesis Data
Stream
Redis stores:
- Job states
- Dependencies
This scheduler checks
new job states against
the state in Redis and
figures out if there are
new jobs that can be
scheduled next
@loige
#CLOUDDAY2022
Dynamic Runtime
Handling
We also need to handle
system failures!
@loige
#CLOUDDAY2022
Outcomes 🙌
Business
● Rollup in 1 hour
● Removed limits on number of runs
● Faster, more consistent deal analytics
● Business spending more time on
revenue-generating activities
● Support portfolio growth and deliver new
capabilities
Technology
● Brought serverless to HPC financial
modeling
● Reduced codebase by ~70%
● Lowered total cost of ownership
● Increased dev team agility
● Reduced carbon footprint
@loige
#CLOUDDAY2022
Hitting the limits 😰
@loige
#CLOUDDAY2022
S3 Throughput
@loige
#CLOUDDAY2022
S3 Partitioning
S3 cleverly detects high-throughput prefixes and creates partitions
….normally
If this does not happen…
🚨Please reduce your request rate;
Status Code: 503; Error Code: SlowDown
@loige
#CLOUDDAY2022
The Solution
Explicit Partitioning:
○Figure out how many partitions you need
○Update code to create keys uniformly distributed over all partitions
/part/0…
/part/1…
/part/2…
/part/3…
…
/part/f…
1. Talk (a lot) to AWS SAs, Support, Account
Manager for special requirements like this!
2. Think ahead if you have multiple accounts
for different environments!
@loige
#CLOUDDAY2022
Fargate Scaling
●We want to run 3000 containers ASAP
●This took > 1 hour!
●We built a custom Fargate scaler
○Using the RunTask API (no ECS Service)
○Hidden quota increases
○Step Function + Lambda
●3000 containers in ~20 minutes
The AWS ECS team since made lots of
improvements, making it possible to scale to
3,000 containers in under 5 minutes
@loige
#CLOUDDAY2022
How high can we go today?
🚀 10,000 concurrent Lambda functions in seconds
🎢 10,000 Fargate containers in 10 minutes
💸 No additional cost
vladionescu.me/posts/scaling-containers-on-aws-in-2022
@loige
#CLOUDDAY2022
Wrapping up 🎁
● "Serverless supercomputer" lets you do HPC with
commodity AWS compute
● Plenty of challenges, but it's doable!
● Agility and innovation benefits are massive
● Customer is now serverless-first and expert in AWS
Other interesting case studies:
☁ AWS HTC Grid - 🧬 COVID genome research
@loige
#CLOUDDAY2022
Special thanks to @eoins and @cmthorne10
fth.link/cd22
@loige
#CLOUDDAY2022

Contenu connexe

Similaire à Serverless for High Performance Computing

Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 

Similaire à Serverless for High Performance Computing (20)

Introduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OKIntroduction to DevOps and the Practical Use Cases at Credit OK
Introduction to DevOps and the Practical Use Cases at Credit OK
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
Efficient Django
Efficient DjangoEfficient Django
Efficient Django
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Fandogh Cloud workshop slides
Fandogh Cloud workshop slides Fandogh Cloud workshop slides
Fandogh Cloud workshop slides
 
Antoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloudAntoine Coetsier - billing the cloud
Antoine Coetsier - billing the cloud
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.Serverless? How (not) to develop, deploy and operate serverless applications.
Serverless? How (not) to develop, deploy and operate serverless applications.
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Eko10 Workshop Opensource Database Auditing
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database Auditing
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Introduction to serverless computing on Google Cloud
Introduction to serverless computing on Google CloudIntroduction to serverless computing on Google Cloud
Introduction to serverless computing on Google Cloud
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 

Plus de Luciano Mammino

Plus de Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Serverless for High Performance Computing

  • 1. Serverless for HPC Luciano Mammino fourTheorem @loige
  • 3. 👋 Hello, I am Luciano Senior architect nodejsdesignpatterns.com Let’s connect: 🌎 loige.co 🐦 @loige 🎥 loige 🧳 lucianomammino
  • 4. Middy Framework SLIC Starter - Serverless Accelerator SLIC Watch - Observability Plugin Business focused technologists. Accelerated Serverless | AI as a Service | Platform Modernisation
  • 5. We host a podcast about AWS and Cloud computing 🔗 awsbites.com 🎬 YouTube Channel 🎙 Podcast 📅 Episodes every week @loige #CLOUDDAY2022
  • 6. Get the slides: fth.link/cd22 @loige #CLOUDDAY2022
  • 7. Agenda ● The 6 Rs of Cloud Migration ● A serverless case study ○ The problem space and types of workflows ○ Original on premise implementation ○ The PoC ○ The final production version ○ The components of a serverless job scheduler ○ Challenges & Limits fth.link/cd22 @loige #CLOUDDAY2022
  • 8. The 6 Rs of Cloud Migrations 🗑 🕸 🚚 Retire Retain Rehost 🏗 📐 💰 Replatform Refactor Repurchase @loige #CLOUDDAY2022 fth.link/cd22
  • 9. A case study Case study on AWS blog: fth.link/awshpc @loige #CLOUDDAY2022
  • 10. The workloads - Risk Rollup 🏦 Financial modeling to understand the portfolio of risk 🧠 Internal, custom-built risk model on all reinsurance deals ⚙ HPC (High-Performance Computing) workload 🗄 ~45TB data processed ⏱ 2/3 rollups per day (6-8 hours each!) @loige #CLOUDDAY2022
  • 11. The workloads - Deal Analytics ⚡ Near real-time deal pricing using the same risk model 🗃 Lower data volumes 🔁 High frequency of execution – up to 1.000 per day @loige #CLOUDDAY2022
  • 13. Challenges 🐢 Long execution times, constraining business agility 🥊 Competing workloads 📈 Limits our ability to support portfolio growth 😩 Can’t deliver new features 🧾 Very high total cost of ownership @loige #CLOUDDAY2022
  • 14. Thinking Big 💭 Imagine a solution that would … 1. Offer a dramatic increase in performance 2. Provide consistent run times 3. Support more executions, more often 4. Support future portfolio growth and new capabilities – 15x data volumes @loige #CLOUDDAY2022
  • 15. The Goal ⚽ Run a Risk Rollup in 1 hour! @loige #CLOUDDAY2022
  • 16. Architecture Options for Compute/Orchestration AWS Lambda Amazon SQS AWS Step Functions AWS Fargate Com t om : Red he b to si l , s a l , ev -d i n co n s @loige #CLOUDDAY2022
  • 17. POC Architecture AWS Batch S3 Step Functions Lambda SQS @loige #CLOUDDAY2022
  • 18. Measure Everything! 📏 ⏱ Built metrics in from the start 󰤈 AWS metrics we wish existed out of the box: - Number of running containers - Success/failure counts 🎨 Custom metrics: - Scheduler overhead - Detailed timings (job duration, I/O time, algorithm steps) 🛠 Using CloudWatch, EMF @loige #CLOUDDAY2022
  • 19. Measure Everything! 📏 👍 Rollup in 1 hour ☁ Running on AWS Batch 👎 Cluster utilisation was <50% ✅ Goal success 🤔 Understanding of what needs to be addressed next! @loige #CLOUDDAY2022
  • 20. Beyond the PoC Production: optimise for unique workload characteristics @loige #CLOUDDAY2022
  • 22. In reality, not all jobs are alike! @loige #CLOUDDAY2022
  • 23. Horizontal scaling 🚀 1000’s of jobs Duration: 1 second – 45 minutes Scaling horizontally = splitting jobs Jobs split according to their complexity/duration Resulting in >1 million jobs @loige #CLOUDDAY2022
  • 24. Moving to production 🚢 @loige #CLOUDDAY2022
  • 26. Actual End to End overview @loige #CLOUDDAY2022
  • 28. Compute Services Scales to 1000’s of tasks (containers) Little management overhead Up to 4 vCPUs and 30GB Memory Up to 200GB ephemeral storage Scales to 1000’s of function containers (in seconds!) Very little management overhead Up to 6 vCPUs and 10GB Memory Up to 10GB ephemeral storage It wasn’t always this way! @loige #CLOUDDAY2022
  • 29. Store all the things in S3! The source of truth for: ● Input Data (JSON, Parquet) ● Intermediate Data (Parquet) ● Results (Parquet) ● Aggregates (Parquet) Input data: 20GB Output data: ~1 TB Reads and writes: 10,000s of objects per second. @loige #CLOUDDAY2022
  • 30. Scheduling and Orchestration ✅ We have our cluster (Fargate or Lambda) ✅ We have a plan! (list of jobs, parameters and dependencies) 🤔 How do we feed this plan to the cluster?! 🤨 Existing schedulers use traditional clusters – there is no serverless job scheduler for workloads like this! @loige #CLOUDDAY2022
  • 31. Lifecycle of a Job A new job get queued here 👇 A worker picks up the job and executes it The worker emits the job state (success or failure) @loige #CLOUDDAY2022
  • 32. Event-Driven Scheduler Job states are pulled from a Kinesis Data Stream Redis stores: - Job states - Dependencies This scheduler checks new job states against the state in Redis and figures out if there are new jobs that can be scheduled next @loige #CLOUDDAY2022
  • 33. Dynamic Runtime Handling We also need to handle system failures! @loige #CLOUDDAY2022
  • 34. Outcomes 🙌 Business ● Rollup in 1 hour ● Removed limits on number of runs ● Faster, more consistent deal analytics ● Business spending more time on revenue-generating activities ● Support portfolio growth and deliver new capabilities Technology ● Brought serverless to HPC financial modeling ● Reduced codebase by ~70% ● Lowered total cost of ownership ● Increased dev team agility ● Reduced carbon footprint @loige #CLOUDDAY2022
  • 35. Hitting the limits 😰 @loige #CLOUDDAY2022
  • 37. S3 Partitioning S3 cleverly detects high-throughput prefixes and creates partitions ….normally If this does not happen… 🚨Please reduce your request rate; Status Code: 503; Error Code: SlowDown @loige #CLOUDDAY2022
  • 38. The Solution Explicit Partitioning: ○Figure out how many partitions you need ○Update code to create keys uniformly distributed over all partitions /part/0… /part/1… /part/2… /part/3… … /part/f… 1. Talk (a lot) to AWS SAs, Support, Account Manager for special requirements like this! 2. Think ahead if you have multiple accounts for different environments! @loige #CLOUDDAY2022
  • 39. Fargate Scaling ●We want to run 3000 containers ASAP ●This took > 1 hour! ●We built a custom Fargate scaler ○Using the RunTask API (no ECS Service) ○Hidden quota increases ○Step Function + Lambda ●3000 containers in ~20 minutes The AWS ECS team since made lots of improvements, making it possible to scale to 3,000 containers in under 5 minutes @loige #CLOUDDAY2022
  • 40. How high can we go today? 🚀 10,000 concurrent Lambda functions in seconds 🎢 10,000 Fargate containers in 10 minutes 💸 No additional cost vladionescu.me/posts/scaling-containers-on-aws-in-2022 @loige #CLOUDDAY2022
  • 41. Wrapping up 🎁 ● "Serverless supercomputer" lets you do HPC with commodity AWS compute ● Plenty of challenges, but it's doable! ● Agility and innovation benefits are massive ● Customer is now serverless-first and expert in AWS Other interesting case studies: ☁ AWS HTC Grid - 🧬 COVID genome research @loige #CLOUDDAY2022
  • 42. Special thanks to @eoins and @cmthorne10 fth.link/cd22 @loige #CLOUDDAY2022