SlideShare une entreprise Scribd logo
1  sur  42
Batch Processing with
Amazon EC2 Container Service
Brennan Saeta, Frank Chen
Infrastructure Engineering @ Coursera
What We Do
Massive Open Online Courses
14 million
learners
120
partners
1000
courses
2.2 million
course completions
Coursera Technical Architecture
Batch Processing Enables...
Reporting
Instructor Reports
● Grade exports
● Learner demographics
● Course progress statistics
Internal Reports
● Business metrics
● Payments reconciliation
Batch Processing Enables...
Marketing
● Recommendation emails
● Batch marketing / reactivation emails
Batch Processing Enables...
Pedagogical Innovation
● Auto-graded programming assignments
● Peer-review matching & analysis
Bad Old Days of Batch Processing
Cascade
● PHP-based job runner
● Run in screen sessions
● Polled for new jobs
● Fragile and unreliable
● Forced restarts on regular basis
Bad Old Days of Batch Processing II
Saturn
● Scala-based scheduled batch process runner
o Powered by Quartz Scheduler library
● Cron-only jobs
o Cannot run jobs on demand
o Some jobs have to wake up frequently to poll
● All jobs ran on same instance (and JVM), causing interference
What Do We Want?
Reliability
● Saturn / Cascade were flaky
● Developers became frustrated with
jobs not running properly
flickr.com/rclarkeimages, cc by-nc 2.0
What Do We Want?
Easy Development
● Developing & testing
locally was difficult
● Little or no boilerplate
should be required
flickr.com/riebart, cc by 2.0
What Do We Want?
Easy Deployment
● Deployment was difficult and
non-deterministic
● “Other services have one-click
tools, why can’t your service
have that too?”
flickr.com/derelllicht, cc by-nd 2.0
What Do We Want?
High Efficiency
● Cost-conscious
● Most jobs complete < 20 minutes
o EC2 rounds costs up to full hour
● Startup time of individual instances
too long for batch processingflickr.com/koertmichiels, cc by-nc-nd 2.0
What Do We Want?
Low Ops Load
● Only one dev-ops engineer --
can’t manage everything
● Developers own their services
● Developers shouldn’t have to
actively monitor services
flickr.com/reynermedia, cc by 2.0
Alternative Technologies
Home-grown Tech
● Tried, but proved
to be unreliable
● Difficult to handle
coordination and
synchronization
● Very powerful, but
hard to
productionize
● Needs actual
DevOps team
● GCE first-class,
everything else
second-class
Amazon ECS
● Low-to-no maintenance solution
● Integrated with AWS infrastructure
● Easy to understand and program for
However…
● No scheduled tasks
● No fine-grained monitoring of tasks
● No retries / delays when cluster out of resources
● Does not integrate well with our
existing Scala APIs and tooling
Iguazú -- Batch Management for ECS
● Named for Iguazú Falls
o World’s largest waterfall
● Batch Task Scheduler
o Immediately
o Deferred (run once at X time)
o Scheduled recurring (cron-like)
● Programmatically accessible
via standard APIs / clients
flickr.com/mrpunto, cc by-nc-nd 2.0
Iguazú Semantic Guarantees
● At most once execution for all jobs
● Jobs will be provided with at least the CPU
and RAM they requested
● Scheduler may elect to skip execution of
some scheduled jobs under adverse
conditions
Iguazú Architecture
Service
s
Services
Iguazú
Admin
Iguazú
frontend
Iguazú
backend
Amazon SQS
Cassandra Worker Machines
Worker Worker
scheduler
developers
learners
ECS APIs
Iguazú Design
● Frontend + Scheduler
o Generates requests (either via API calls or internally from the
scheduler)
o Puts new requests in SQS queues
o Handles requests for status from other services
● Backend
o Attempts to run tasks via ECS API
 Failure (e.g. lack of resources) means task goes back into queue
to try again later
o Keeps track of task status and updates Cassandra
Iguazú Admin User Interface
Iguazú Admin User Interface
Developing Iguazú Tasks
class Job extends AbstractJob with StrictLogging {
override val reservedCpu = 1024
override val reservedMemory = 1024
def run(parameters: JsValue) = {
logger.info(“I am running my Job!”)
expensiveComputationHere()
}
}
Running Tasks Locally
$ sbt
> project iguazuJobs
[info] Set current project to iguazu-jobs
> run sample sample.json
[info] Running org.coursera.iguazu.internal.IguazuJobsMain
sample sample.json
[info] 2015-07-08 13:31:52,368 INFO [o.c.i.j.s.Job] >>> I am
running my job!
[success] Total time: 5 s, completed Jul 8, 2015 1:31:58 PM
>
Running Tasks from Other Services
// invoking a job with one command
// from another service via Naptime REST framework
val invocationId = IguazuJobInvocationClient
.create(IguazuJobInvocationRequest(
family = "mailer",
jobName = "recommendationsEmail",
parameters = emailParams))
Deploying Tasks
Easy Deployment
1. Merge into master. Done!
Jenkins Build Steps:
1. Builds zip package from master
2. Prepares Docker image
3. Pushes docker image into docker registry
4. Registers updated tasks with ECS APIs
Logs
● Logs are in /var/lib/docker/containers/*
● Upload into log analysis service (Sumologic for us)
● Wrapper prints out task name and task ID
at the start for easy searching
● Good enough for now
Metrics
● Using third-party metrics collector (Datadog)
● Metrics for both tasks and container instances
● So long as can talk to Internet,
things will work out pretty well
Programming
Assignments
Programming Assignments
Grading Prog. Assignments
● Special case of batch processing
● Near real-time feedback
o <30 seconds for fast graders
● Compiling and running untrusted code
● Infrastructure security huge concern
o Minimize exfiltration of info (e.g. test cases)
o Avoid turning into bitcoin miners or DDoS attack
GRading Inside Docker: Architecture
Alternate (GRID) AWS AccountProduction AWS Account
Amazon S3
Amazon ECS
Strict network firewall
Worker Worker
Ctrl
Cluster
Ctrl
Cluster
GRID: Defense in depth
Network
Completely separate AWS account
Network ACLs & Routing Tables
Host Docker / Other
+ additional mitigation
techniques and defenses
Custom cleaning of container images
Modifying the ECS Agent
● Coursera has 2 simple forks of ECS agent
o Allow privileged docker-in-docker access for the
“cleaning” agent
o Disable networking and disk writes for untrusted
code in the “grading” agent
● Check it out at:
github.com/coursera/amazon-ecs-agent
Usage
● Iguazú
o 38 tasks written since launch in April
o 24 scheduled tasks
o >1000 invocations per day
● Grid
o Pre-production at this time (launching in weeks)
o Dozens of graders already written by multiple
instructional teams!
Future Improvements
● True Autoscaling
o Scaling up is easy, scaling down not so much
● Task prioritization (multiple queues)
● Simulate memory and CPU limits in dev
modes
Lessons Learned / Docker war stories
● Docker instability fun
o Container format changing between 1.0 and 1.5.
● btrfs wedging on older kernels
o Default 14.04 kernel not new enough!
● Disk usage
o Docker-in-docker can’t clean up after itself (yet).
Thank You!
Questions?
We are hiring! www.coursera.org/jobs

Contenu connexe

Tendances

(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS LambdaAmazon Web Services
 
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)Amazon Web Services
 
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)Amazon Web Services
 
Amazon Batch: 實現簡單且有效率的批次運算
Amazon Batch: 實現簡單且有效率的批次運算Amazon Batch: 實現簡單且有效率的批次運算
Amazon Batch: 實現簡單且有效率的批次運算Amazon Web Services
 
SMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsSMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsAmazon Web Services
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersAmazon Web Services
 
Real-Time Processing Using AWS Lambda
Real-Time Processing Using AWS LambdaReal-Time Processing Using AWS Lambda
Real-Time Processing Using AWS LambdaAmazon Web Services
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON
 
Micrsoservices unleashed with containers and ECS
Micrsoservices unleashed with containers and ECSMicrsoservices unleashed with containers and ECS
Micrsoservices unleashed with containers and ECSAmazon Web Services
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料Amazon Web Services
 
Auto scaling applications in 10 minutes (CakeFest 2013)
Auto scaling applications in 10 minutes (CakeFest 2013)Auto scaling applications in 10 minutes (CakeFest 2013)
Auto scaling applications in 10 minutes (CakeFest 2013)Juan Basso
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSAmazon Web Services
 
AWS CloudFormation Best Practices
AWS CloudFormation Best PracticesAWS CloudFormation Best Practices
AWS CloudFormation Best PracticesAmazon Web Services
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
 
(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data WorkloadsAmazon Web Services
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
SMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS LambdaSMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS LambdaAmazon Web Services
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews, Inc.
 
Getting Started with Docker on AWS
Getting Started with Docker on AWSGetting Started with Docker on AWS
Getting Started with Docker on AWSAmazon Web Services
 

Tendances (20)

(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
 
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)
AWS re:Invent 2016: NEW LAUNCH! Lambda Everywhere (IOT309)
 
Amazon Batch: 實現簡單且有效率的批次運算
Amazon Batch: 實現簡單且有效率的批次運算Amazon Batch: 實現簡單且有效率的批次運算
Amazon Batch: 實現簡單且有效率的批次運算
 
SMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step FunctionsSMC304 Serverless Orchestration with AWS Step Functions
SMC304 Serverless Orchestration with AWS Step Functions
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million Users
 
Real-Time Processing Using AWS Lambda
Real-Time Processing Using AWS LambdaReal-Time Processing Using AWS Lambda
Real-Time Processing Using AWS Lambda
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Micrsoservices unleashed with containers and ECS
Micrsoservices unleashed with containers and ECSMicrsoservices unleashed with containers and ECS
Micrsoservices unleashed with containers and ECS
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Auto scaling applications in 10 minutes (CakeFest 2013)
Auto scaling applications in 10 minutes (CakeFest 2013)Auto scaling applications in 10 minutes (CakeFest 2013)
Auto scaling applications in 10 minutes (CakeFest 2013)
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWS
 
AWS CloudFormation Best Practices
AWS CloudFormation Best PracticesAWS CloudFormation Best Practices
AWS CloudFormation Best Practices
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
 
(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads(CMP403) AWS Lambda: Simplifying Big Data Workloads
(CMP403) AWS Lambda: Simplifying Big Data Workloads
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
SMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS LambdaSMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS Lambda
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
Getting Started with Docker on AWS
Getting Started with Docker on AWSGetting Started with Docker on AWS
Getting Started with Docker on AWS
 

Similaire à Batch Processing with Amazon EC2 Container Service

Docker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionDocker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionBrennan Saeta
 
Scala, ECS, Docker: Delayed Execution @Coursera
Scala, ECS, Docker: Delayed Execution @CourseraScala, ECS, Docker: Delayed Execution @Coursera
Scala, ECS, Docker: Delayed Execution @CourseraC4Media
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
So you want to write a cloud function
So you want to write a cloud functionSo you want to write a cloud function
So you want to write a cloud functionElad Hirsch
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftYaniv cohen
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in ContainerizationRyan Hunter
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Brennan Saeta
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microserviceAmazon Web Services
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Gradle - From minutes to seconds: minimizing build times
Gradle - From minutes to seconds: minimizing build timesGradle - From minutes to seconds: minimizing build times
Gradle - From minutes to seconds: minimizing build timesRene Gröschke
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...Craeg Strong
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production 6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production Hung Lin
 
Next Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessNext Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessDavid Ko
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 

Similaire à Batch Processing with Amazon EC2 Container Service (20)

Docker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline ExecutionDocker & ECS: Secure Nearline Execution
Docker & ECS: Secure Nearline Execution
 
Scala, ECS, Docker: Delayed Execution @Coursera
Scala, ECS, Docker: Delayed Execution @CourseraScala, ECS, Docker: Delayed Execution @Coursera
Scala, ECS, Docker: Delayed Execution @Coursera
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
So you want to write a cloud function
So you want to write a cloud functionSo you want to write a cloud function
So you want to write a cloud function
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShift
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...Amazon ECS at Coursera: A unified execution framework while defending against...
Amazon ECS at Coursera: A unified execution framework while defending against...
 
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice(CMP406) Amazon ECS at Coursera: A general-purpose microservice
(CMP406) Amazon ECS at Coursera: A general-purpose microservice
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Gradle - From minutes to seconds: minimizing build times
Gradle - From minutes to seconds: minimizing build timesGradle - From minutes to seconds: minimizing build times
Gradle - From minutes to seconds: minimizing build times
 
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production 6 Months Sailing with Docker in Production
6 Months Sailing with Docker in Production
 
Next Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessNext Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus Wireless
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Batch Processing with Amazon EC2 Container Service

  • 1. Batch Processing with Amazon EC2 Container Service Brennan Saeta, Frank Chen Infrastructure Engineering @ Coursera
  • 2.
  • 3. What We Do Massive Open Online Courses 14 million learners 120 partners 1000 courses 2.2 million course completions
  • 5. Batch Processing Enables... Reporting Instructor Reports ● Grade exports ● Learner demographics ● Course progress statistics Internal Reports ● Business metrics ● Payments reconciliation
  • 6. Batch Processing Enables... Marketing ● Recommendation emails ● Batch marketing / reactivation emails
  • 7. Batch Processing Enables... Pedagogical Innovation ● Auto-graded programming assignments ● Peer-review matching & analysis
  • 8. Bad Old Days of Batch Processing Cascade ● PHP-based job runner ● Run in screen sessions ● Polled for new jobs ● Fragile and unreliable ● Forced restarts on regular basis
  • 9. Bad Old Days of Batch Processing II Saturn ● Scala-based scheduled batch process runner o Powered by Quartz Scheduler library ● Cron-only jobs o Cannot run jobs on demand o Some jobs have to wake up frequently to poll ● All jobs ran on same instance (and JVM), causing interference
  • 10. What Do We Want? Reliability ● Saturn / Cascade were flaky ● Developers became frustrated with jobs not running properly flickr.com/rclarkeimages, cc by-nc 2.0
  • 11. What Do We Want? Easy Development ● Developing & testing locally was difficult ● Little or no boilerplate should be required flickr.com/riebart, cc by 2.0
  • 12. What Do We Want? Easy Deployment ● Deployment was difficult and non-deterministic ● “Other services have one-click tools, why can’t your service have that too?” flickr.com/derelllicht, cc by-nd 2.0
  • 13. What Do We Want? High Efficiency ● Cost-conscious ● Most jobs complete < 20 minutes o EC2 rounds costs up to full hour ● Startup time of individual instances too long for batch processingflickr.com/koertmichiels, cc by-nc-nd 2.0
  • 14. What Do We Want? Low Ops Load ● Only one dev-ops engineer -- can’t manage everything ● Developers own their services ● Developers shouldn’t have to actively monitor services flickr.com/reynermedia, cc by 2.0
  • 15. Alternative Technologies Home-grown Tech ● Tried, but proved to be unreliable ● Difficult to handle coordination and synchronization ● Very powerful, but hard to productionize ● Needs actual DevOps team ● GCE first-class, everything else second-class
  • 16. Amazon ECS ● Low-to-no maintenance solution ● Integrated with AWS infrastructure ● Easy to understand and program for
  • 17. However… ● No scheduled tasks ● No fine-grained monitoring of tasks ● No retries / delays when cluster out of resources ● Does not integrate well with our existing Scala APIs and tooling
  • 18. Iguazú -- Batch Management for ECS ● Named for Iguazú Falls o World’s largest waterfall ● Batch Task Scheduler o Immediately o Deferred (run once at X time) o Scheduled recurring (cron-like) ● Programmatically accessible via standard APIs / clients flickr.com/mrpunto, cc by-nc-nd 2.0
  • 19. Iguazú Semantic Guarantees ● At most once execution for all jobs ● Jobs will be provided with at least the CPU and RAM they requested ● Scheduler may elect to skip execution of some scheduled jobs under adverse conditions
  • 20. Iguazú Architecture Service s Services Iguazú Admin Iguazú frontend Iguazú backend Amazon SQS Cassandra Worker Machines Worker Worker scheduler developers learners ECS APIs
  • 21. Iguazú Design ● Frontend + Scheduler o Generates requests (either via API calls or internally from the scheduler) o Puts new requests in SQS queues o Handles requests for status from other services ● Backend o Attempts to run tasks via ECS API  Failure (e.g. lack of resources) means task goes back into queue to try again later o Keeps track of task status and updates Cassandra
  • 22. Iguazú Admin User Interface
  • 23. Iguazú Admin User Interface
  • 24. Developing Iguazú Tasks class Job extends AbstractJob with StrictLogging { override val reservedCpu = 1024 override val reservedMemory = 1024 def run(parameters: JsValue) = { logger.info(“I am running my Job!”) expensiveComputationHere() } }
  • 25. Running Tasks Locally $ sbt > project iguazuJobs [info] Set current project to iguazu-jobs > run sample sample.json [info] Running org.coursera.iguazu.internal.IguazuJobsMain sample sample.json [info] 2015-07-08 13:31:52,368 INFO [o.c.i.j.s.Job] >>> I am running my job! [success] Total time: 5 s, completed Jul 8, 2015 1:31:58 PM >
  • 26. Running Tasks from Other Services // invoking a job with one command // from another service via Naptime REST framework val invocationId = IguazuJobInvocationClient .create(IguazuJobInvocationRequest( family = "mailer", jobName = "recommendationsEmail", parameters = emailParams))
  • 27. Deploying Tasks Easy Deployment 1. Merge into master. Done! Jenkins Build Steps: 1. Builds zip package from master 2. Prepares Docker image 3. Pushes docker image into docker registry 4. Registers updated tasks with ECS APIs
  • 28. Logs ● Logs are in /var/lib/docker/containers/* ● Upload into log analysis service (Sumologic for us) ● Wrapper prints out task name and task ID at the start for easy searching ● Good enough for now
  • 29. Metrics ● Using third-party metrics collector (Datadog) ● Metrics for both tasks and container instances ● So long as can talk to Internet, things will work out pretty well
  • 32.
  • 33.
  • 34. Grading Prog. Assignments ● Special case of batch processing ● Near real-time feedback o <30 seconds for fast graders ● Compiling and running untrusted code ● Infrastructure security huge concern o Minimize exfiltration of info (e.g. test cases) o Avoid turning into bitcoin miners or DDoS attack
  • 35. GRading Inside Docker: Architecture Alternate (GRID) AWS AccountProduction AWS Account Amazon S3 Amazon ECS Strict network firewall Worker Worker Ctrl Cluster Ctrl Cluster
  • 36. GRID: Defense in depth Network Completely separate AWS account Network ACLs & Routing Tables Host Docker / Other + additional mitigation techniques and defenses Custom cleaning of container images
  • 37. Modifying the ECS Agent ● Coursera has 2 simple forks of ECS agent o Allow privileged docker-in-docker access for the “cleaning” agent o Disable networking and disk writes for untrusted code in the “grading” agent ● Check it out at: github.com/coursera/amazon-ecs-agent
  • 38.
  • 39. Usage ● Iguazú o 38 tasks written since launch in April o 24 scheduled tasks o >1000 invocations per day ● Grid o Pre-production at this time (launching in weeks) o Dozens of graders already written by multiple instructional teams!
  • 40. Future Improvements ● True Autoscaling o Scaling up is easy, scaling down not so much ● Task prioritization (multiple queues) ● Simulate memory and CPU limits in dev modes
  • 41. Lessons Learned / Docker war stories ● Docker instability fun o Container format changing between 1.0 and 1.5. ● btrfs wedging on older kernels o Default 14.04 kernel not new enough! ● Disk usage o Docker-in-docker can’t clean up after itself (yet).
  • 42. Thank You! Questions? We are hiring! www.coursera.org/jobs

Notes de l'éditeur

  1. Service-based architecture (dozens of services) Scala-based backends, with legacy PHP and Python Cassandra, MySQL and S3 as data storage layer Completely within the AWS Cloud Utilize a lot of AWS services Tooling developed specifically for AWS ops
  2. Our programming assignment infrastructure must be very flexible, powerful and general purpose. We have courses that cover a huge range of topics. Some courses are an introduction to computing with Python. Others are complete tutorials into functional reactive programming in Scala. Other courses use Javascript for learning, or Matlab for Machine Learning. We have a course on full stack web development based upon Ruby on Rails. We even have advanced courses in parallel programming that require the use of the GPU!
  3. This is an example programming assignment for a new course we are about to launch.
  4. This is a view of the web upload. https://www.coursera.org/learn/spcpp-2/programming/ZZOH8/bian-cheng-zuo-ye
  5. Huge thank you’s also to: Bryan Kane Colleen Lee Nick Dellamaggiore Kam Syed (Amazon) KD Singh (Amazon)