SlideShare a Scribd company logo
1 of 24
Download to read offline
Autopilot: workload
autoscaling at Google
Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych,
Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes
(Google)
EuroSys 2020
April 2020
Proprietary + Confidential
Google runs in
containers
In any given week, we
launch over two billion
containers across
Google.
Proprietary + Confidential
Resource limits are crucial to isolate workloads
container limit:
max amount of CPU/mem
a container can use
container usage:
CPU/mem used
container slack:
CPU/mem wasted
Proprietary + Confidential
Borg, our scheduler,
packs containers to
machines by resource
limits.
image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]
machines
Proprietary + Confidential
Limits are fine-grained:
CPU in milli-cores
memory in bytes
Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]
Proprietary + Confidential
We pack containers to machines by limits.
So, precise limits are crucial for efficiency and reliability.
limit
machine
capacity
container B
container C
container A
Proprietary + Confidential
We pack containers to machines by limits.
So, precise limits are crucial for efficiency and reliability.
precise
limits
good!
limit
limit
usage
machine
capacity
container B
container C
container A
Proprietary + Confidential
We pack containers to machines by limits.
So, precise limits are crucial for efficiency and reliability.
out-of-resources
crash
precise
limits
good!
limit
resource are
wasted
(underallocated
machine)
bad! bad!
limit
usage
requested
usage
container
limit
machine
capacity
container B
container C
container A
Proprietary + Confidential
Autopilot acts as a controller for Borg limits.
Autopilot continuously adjusts resource limits:
CPU/Mem limits for containers (vertical scaling),
number of replicas (horizontal scaling).
container
limits
container
counts container
limits
start/stop
containers
Proprietary + Confidential
Autopilot Recommenders
Proprietary + Confidential
Moving window recommenders
● Exponentially-decaying samples
(half-life of 48 hours)
● Compute statistics over the
samples, e.g. 95%ile
● add a safety margin
time
resources
usage
98%ile
Proprietary + Confidential
Moving window recommenders
● Exponentially-decaying samples
(half-life of 48 hours)
● Compute statistics over the
samples, e.g. 95%ile
● add a safety margin
time
resources
usage
exponential decay
98%ile
Proprietary + Confidential
Moving window recommenders
● Exponentially-decaying samples
(half-life of 48 hours)
● Compute statistics over the
samples, e.g. 95%ile
● add a safety margin
time
resources
usage
exponential decay
safety margin
limit
Proprietary + Confidential
● Each model is an arg-max
algorithm picking a limit value
● Each model is parametrized by
the decay rate and the safety
margin.
● The recommender picks the
model performing the best over a
longer time period.
Machine learning recommenders
decay rate
limit
model n
model 1 model 2 …….
Proprietary + Confidential
Evaluation:
Observational study of production jobs
Focus on memory
Proprietary + Confidential
Autopilot efficiency - reduction of slack
relative slack:
(av_limit - 95%ile usage) / (av_limit)
16
limit(t)
usage(t)
slack(t)
absolute slack:
∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt
unit: capacity of a single (largish)
machine
av_limit
average
limit during
the day
95%ile
usage
during
the day
(av_limit - 95%ile usage)
Proprietary + Confidential
A random sample of 5000 jobs in each
category.
Cumulative
distribution
function
Autopiloted jobs have
significantly smaller
relative slack.
Relative slack: (av_limit - 95%ile usage) / av_limit
av_limit
(av_limit - 95%ile usage)
better worse
Proprietary + Confidential
Autopiloted jobs save
significant capacity.
Cumulative
distribution
function
Absolute slack [machines]
Absolute slack
A random sample of 5000 jobs in each
category.
better
worse
Proprietary + Confidential
When jobs migrate to
Autopilot, their slack is
significantly reduced.
A random sample of 500
jobs that migrated to
autopilot in a certain
month, m0.
CDFs for slack for 2 months
before and after migration
reduction of relative slack
Cumulative
distribution
function
better
worse
Proprietary + Confidential
Autopilot Reliability:
how frequent are
out-of-memory errors.
We count terminations of
containers.
We weight the number of
terminations by the average
number of containers of a job.
out-of-resources crash
requested
usage
container
limit
machine
capacity
Proprietary + Confidential
Autopilot reduces the
frequency of
out-of-memory events.
OOMs are rare: 99.5% of
autopiloted jobs have no
OOMs.
Cumulative
distribution
function
better
worse
Proprietary + Confidential
DevOps:
Autopiloted jobs account for over
48% of Google’s fleet-wide resource
usage.
Proprietary + Confidential
Autopilot’s dynamic limits could help to
keep the job running despite bugs.
Proprietary + Confidential
1. Efficient scheduling requires fine-grained control of jobs’ limits
2. Humans are bad at setting the limits precisely.
3. Autopilot uses past usage to drive future limits
4. Autopilot reduces relative slack by 2x
...and it reduces the number of jobs severely impacted by OOMs 10x
Autopilot: workload autoscaling at Google

More Related Content

Similar to EuroSys 2020 - AutoPilot - Slides.pdf

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17Mario-Leander Reimer
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackQAware GmbH
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudWolfgang Gentzsch
 
Cray HPC Environments for Leading Edge Simulations
Cray HPC Environments for Leading Edge SimulationsCray HPC Environments for Leading Edge Simulations
Cray HPC Environments for Leading Edge Simulationsinside-BigData.com
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot InstancesImply
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Amazon Web Services
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningIndrajit Poddar
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Reduce costs by using CICD for OpenStack
Reduce costs by using CICD for OpenStackReduce costs by using CICD for OpenStack
Reduce costs by using CICD for OpenStackAntonHaldin
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application sizeKeval Patel
 
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingDDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingJaime Martin Losa
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)François
 
NAFEMS Smart Manufacturing - UberCloud
NAFEMS Smart Manufacturing - UberCloudNAFEMS Smart Manufacturing - UberCloud
NAFEMS Smart Manufacturing - UberCloudThomas Francis
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudThe UberCloud
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupIvano Malavolta
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosmictc
 

Similar to EuroSys 2020 - AutoPilot - Slides.pdf (20)

An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
A Hitchhiker’s Guide to the Cloud Native Stack. #CDS17
 
A hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stackA hitchhiker‘s guide to the cloud native stack
A hitchhiker‘s guide to the cloud native stack
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the Cloud
 
Cray HPC Environments for Leading Edge Simulations
Cray HPC Environments for Leading Edge SimulationsCray HPC Environments for Leading Edge Simulations
Cray HPC Environments for Leading Edge Simulations
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot Instances
 
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
Finding New Sub-Atomic Particles on the AWS Cloud (BDT402) | AWS re:Invent 2013
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Transparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep LearningTransparent Hardware Acceleration for Deep Learning
Transparent Hardware Acceleration for Deep Learning
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Reduce costs by using CICD for OpenStack
Reduce costs by using CICD for OpenStackReduce costs by using CICD for OpenStack
Reduce costs by using CICD for OpenStack
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
 
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin MeetingDDS Advanced Tutorial - OMG June 2013 Berlin Meeting
DDS Advanced Tutorial - OMG June 2013 Berlin Meeting
 
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)
Mind your App Footprint 🐾⚡️🌱 (@FlutterConn 2023)
 
NAFEMS Smart Manufacturing - UberCloud
NAFEMS Smart Manufacturing - UberCloudNAFEMS Smart Manufacturing - UberCloud
NAFEMS Smart Manufacturing - UberCloud
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the Cloud
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setup
 
Microsoft Azure in HPC scenarios
Microsoft Azure in HPC scenariosMicrosoft Azure in HPC scenarios
Microsoft Azure in HPC scenarios
 

Recently uploaded

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 

Recently uploaded (20)

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 

EuroSys 2020 - AutoPilot - Slides.pdf

  • 1. Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020
  • 2. Proprietary + Confidential Google runs in containers In any given week, we launch over two billion containers across Google.
  • 3. Proprietary + Confidential Resource limits are crucial to isolate workloads container limit: max amount of CPU/mem a container can use container usage: CPU/mem used container slack: CPU/mem wasted
  • 4. Proprietary + Confidential Borg, our scheduler, packs containers to machines by resource limits. image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15] machines
  • 5. Proprietary + Confidential Limits are fine-grained: CPU in milli-cores memory in bytes Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]
  • 6. Proprietary + Confidential We pack containers to machines by limits. So, precise limits are crucial for efficiency and reliability. limit machine capacity container B container C container A
  • 7. Proprietary + Confidential We pack containers to machines by limits. So, precise limits are crucial for efficiency and reliability. precise limits good! limit limit usage machine capacity container B container C container A
  • 8. Proprietary + Confidential We pack containers to machines by limits. So, precise limits are crucial for efficiency and reliability. out-of-resources crash precise limits good! limit resource are wasted (underallocated machine) bad! bad! limit usage requested usage container limit machine capacity container B container C container A
  • 9. Proprietary + Confidential Autopilot acts as a controller for Borg limits. Autopilot continuously adjusts resource limits: CPU/Mem limits for containers (vertical scaling), number of replicas (horizontal scaling). container limits container counts container limits start/stop containers
  • 11. Proprietary + Confidential Moving window recommenders ● Exponentially-decaying samples (half-life of 48 hours) ● Compute statistics over the samples, e.g. 95%ile ● add a safety margin time resources usage 98%ile
  • 12. Proprietary + Confidential Moving window recommenders ● Exponentially-decaying samples (half-life of 48 hours) ● Compute statistics over the samples, e.g. 95%ile ● add a safety margin time resources usage exponential decay 98%ile
  • 13. Proprietary + Confidential Moving window recommenders ● Exponentially-decaying samples (half-life of 48 hours) ● Compute statistics over the samples, e.g. 95%ile ● add a safety margin time resources usage exponential decay safety margin limit
  • 14. Proprietary + Confidential ● Each model is an arg-max algorithm picking a limit value ● Each model is parametrized by the decay rate and the safety margin. ● The recommender picks the model performing the best over a longer time period. Machine learning recommenders decay rate limit model n model 1 model 2 …….
  • 15. Proprietary + Confidential Evaluation: Observational study of production jobs Focus on memory
  • 16. Proprietary + Confidential Autopilot efficiency - reduction of slack relative slack: (av_limit - 95%ile usage) / (av_limit) 16 limit(t) usage(t) slack(t) absolute slack: ∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt unit: capacity of a single (largish) machine av_limit average limit during the day 95%ile usage during the day (av_limit - 95%ile usage)
  • 17. Proprietary + Confidential A random sample of 5000 jobs in each category. Cumulative distribution function Autopiloted jobs have significantly smaller relative slack. Relative slack: (av_limit - 95%ile usage) / av_limit av_limit (av_limit - 95%ile usage) better worse
  • 18. Proprietary + Confidential Autopiloted jobs save significant capacity. Cumulative distribution function Absolute slack [machines] Absolute slack A random sample of 5000 jobs in each category. better worse
  • 19. Proprietary + Confidential When jobs migrate to Autopilot, their slack is significantly reduced. A random sample of 500 jobs that migrated to autopilot in a certain month, m0. CDFs for slack for 2 months before and after migration reduction of relative slack Cumulative distribution function better worse
  • 20. Proprietary + Confidential Autopilot Reliability: how frequent are out-of-memory errors. We count terminations of containers. We weight the number of terminations by the average number of containers of a job. out-of-resources crash requested usage container limit machine capacity
  • 21. Proprietary + Confidential Autopilot reduces the frequency of out-of-memory events. OOMs are rare: 99.5% of autopiloted jobs have no OOMs. Cumulative distribution function better worse
  • 22. Proprietary + Confidential DevOps: Autopiloted jobs account for over 48% of Google’s fleet-wide resource usage.
  • 23. Proprietary + Confidential Autopilot’s dynamic limits could help to keep the job running despite bugs.
  • 24. Proprietary + Confidential 1. Efficient scheduling requires fine-grained control of jobs’ limits 2. Humans are bad at setting the limits precisely. 3. Autopilot uses past usage to drive future limits 4. Autopilot reduces relative slack by 2x ...and it reduces the number of jobs severely impacted by OOMs 10x Autopilot: workload autoscaling at Google