SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Capacity 
While Cash 
Kim Moir, Mozilla @kmoir 
URES, Seattle, Nov 10, 2014 
Good afternoon. My name is Kim Moir and I’m a release engineer at Mozilla. Today I’m going to discuss how Mozilla scaled our infrastructure on AWS to 
handle the increasing load on our continuous integration farm, while reducing our monthly bills at the same time. 
—- 
References 
Montreal Subway picture 
https://www.flickr.com/photos/dephineprieur/3841791164/sizes/o/
Mozilla is a non-profit. Our mission is to promote openness, innovation & opportunity on the Web. 
! 
You’re probably familiar with the products we build, such as Firefox for Desktop and Android and Firefox OS. Firefox OS is a relatively new product that 
Mozilla started working on a few years ago. It’s an open source operating system for smartphones. When there was a new product coming on line, we 
knew that we would have to be able to scale our build farm to handle additional load. 
! 
Note that we ship Firefox on four platforms and with ~97 locales on the same day as US English
Our release cadence is every six weeks for Firefox for Desktop and Android. We release betas every week. 
FirefoxOS is on a different cadence. 
https://wiki.mozilla.org/RapidRelease
Release Engineering are a very geographically distributed team and many of us work remotely. Even those people who work close to a physical Mozilla 
office work several days a week from home.
Before I talk about how we scaled our build and test infrastructure on AWS, I’m going to talk a bit about the scale of our operations. How many builds we 
run, how many tests, the number of platforms, number of repositories etc. 
Image: 
https://www.flickr.com/photos/30649191@N00/9002545206/sizes/l
Daily 
! 
4500 build jobs 
70,000 test jobs 
Each time a developer lands a change, it invokes a series of builds and associated tests on relevant platforms. Within each test job there are many actual 
test suites that run. 
!!!
We have a commitment to developers that build/test jobs should start within 15 minutes of being requested. We don’t have a perfect record on this, but 
certainly our numbers are good. We have metrics that measure this every day so we can see what platforms need additional capacity. And we adjust 
capacity as needed, and remove old platforms as they become less relevant in the marketplace. 
! 
——— 
Pizza picture 
https://www.flickr.com/photos/djwtwo/9864611814/sizes/l/
Platforms 
• Windows 
• Mac 
• Linux 
• Android 
• all x many os versions 
We build and test on the following platforms. And many different releases of these platforms.
You can see all the versions of the platforms we build for on this page, called http://treeherder.mozilla.org. It’s a web page anyone can see. Our 
developers look at it to see the results of their builds and tests, organized by branch.
Devices 
• 5600+ in total 
• 1600+ for builds 
• 4000+ for tests 
We have lot of hardware used on our build farm, both in our two datacenters, and virtually (AWS) 
! 
——- 
References 
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html 
* https://secure.pub.build.mozilla.org/slavealloc/ui/#silos
Most companies that do a lot of mobile device testing just have a roomful of devices that developers can test on. 
! 
We actually run continuous integration tests on Android reference cards. We have about 800 of them. They are called pandas and are rack mounted. 
These devices are not as stable as desktop devices, and are prone to failure. Given their numbers, having to deal with the machines failing all the time is 
very expensive if they were managed by humans. 
! 
As an aside, the failure rate on these reference devices is much higher (18%) than running the tests on emulators in AWS (2%) 
! 
___ 
References 
Pictures of Panda chassis from Dustin’s blog 
https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/
Bursty traffic - you can see that the number of jobs run each day is variable as time zones wake up. The large trough is obviously the weekend. 
! 
source: http://atlee.ca/blog/posts/bursty-load.html
We also have a lot of repositories to manage
We have many different branches in Hg at Mozilla. Our Hg branches are all named after different tree species 
Developers push to different branches depending on their purpose. Different various branches have to different scheduling priorities within our 
continuous integration engine. So for instance, if a change is landed in a mozilla-beta branch, the builds and tests associated with that change will have 
machines allocated to them with at a higher priority than if a change was landed on a cedar branch which is just for testing purposes.
+ many Mozilla tools 
Here are some of projects that we use in our infrastructure. 
! 
Buildbot is our continuous integration engine. It’s an open source project written in Python. We spend a lot of time writing Python to extend and 
customize it. 
! 
We use Puppet for configuration management all our Buildbot masters, and the Linux and Mac slaves. So when we provision new hardware, we just boot 
the device and it puppetizes based on it’s role that’s defined by it’s hostname. 
! 
Our repository of record is hg.mozilla.org but developers also commit to git repos and these commits are transferred to the hg repository. We also use a 
lot of mozilla tools that allow us to scale. These tools are open source as well and I have links at the end of the talk to these repos. 
! 
—— 
References 
octokitty http://www.flickr.com/photos/tachikoma/2760470578/sizes/l/
This is a picture of how the different parts of our build farm work together. Developers land change on code repositories such as hg.mozilla.org. 
! 
As I mentioned before, we use an open source continuous integration engine called Buildbot. We have over 50 buildbot masters. Masters are 
segregated by function to run tests, builds, scheduling, and try. Test and build masters are further divided by function so we can limit the type of jobs they 
run and the types of slaves they serve. For instance, a master may have Windows build slaves allocated to it. Or Android test slaves. This makes the 
masters more efficient because you don’t need to have every type of job loaded and consuming memory. It also makes maintenance more efficient in that 
you can bring down for example, Android test masters for maintenance without having to touch other platforms. 
! 
Buildbot polls the hg push log for each of the code repositories. (Hgpoller) 
! 
When the poller detects a change, the information about the change is written into the scheduler database. The buildbot scheduler masters are 
responsible for taking this request in the database and creating a new build request. The build request then will appear as pending in the web page in the 
previous slide. 
! 
The jobs may be on existing hardware in our data centre, or new VMs may start or be created in the cloud to run these pending jobs.
This is a street in Bangkok. As you can see, lots of traffic, not much movement. There used to be a problem at Mozilla where some platforms didn’t have 
good wait times because we simply didn’t have the slave capacity to handle them. Many pending tests. This was a source of frustration for developers. 
! 
We used to run all our builds and tests on in-house hardware in our data centres. This was inefficient in that it took a long time to acquire, rack and install 
the machines and burn them in. Also, we could not dynamically bring machines up to deal with peak load, and then put them offline when they were no 
longer needed. 
!
So in early 2012 we started investigating how we could better scale this traffic. 
! 
We investigated running jobs on AWS starting with CentOS machines. One of the things that allowed us to move to AWS more easily was that we use 
Puppet to manage the configuration many of our build and test slaves (exception is Windows and Android). Our puppet modules are role based so the 
modifications required to add Amazon VMs were not that difficult. 
! 
This move to AWS provided additional capacity, and some of the machines in our data centres were repurposed to pools that were lacking capacity. 
—- 
Reference 
Cloud picture http://www.flickr.com/photos/paul-vallejo/2359829594/sizes/l/
AWS Terminology 
• EC2 - Elastic compute 2 - machines as VMs 
• EBS - Elastic block store - network attached 
storage 
• Region - separate geographical area 
• Availability zone - Multiple, isolated locations 
within a region 
I’m going to talk a bit about some AWS terms for those of you that may not be familiar with them. 
! 
Notes: 
AWS instance types http://aws.amazon.com/ec2/instance-types/
More AWS terms 
• AMI - Amazon machine image 
• instance type - VM with defined specifications 
and cost per hour. For example: 
-AMIs - Amazon has standard ones that you can modify or create your own 
-pricing on instance types can depend on the region 
-m3.medium currently costs around $0.07hr in most regions 
-Some instance types may not be available in all availability zones
Source: http://oduinn.com/blog/2012/11/27/releng-production-systems-now-in-3-aws-regions/ 
We have most of our servers in us-east1 and us-west2. us-west-1 doesn’t have much in it right now, it would be used as a hot backup if one of the other 
regions went down. 
Also some traffic is routed over the internet now (ftp via ssl) 
! 
2 in-house data centres 
3 AWS regions 
VPC (private cloud for us within Amazon) 
VPN link between the our data centres and Amazon 
! 
Other notes: 
-using internet for VCS traffic is also part of the story = IPSEC tunnel is a limited and expensive resource, by moving traffic that has “built in” security and 
integrity checking out of the tunnel, allowing greater capacity 
! 
60% of our capacity is in AWS. This number does not reflect the amount of traffic just the amount of available devices
We migrated 
• Linux build and subset of test slaves 
• Builds for Android and tests on Android Emulators 
• Buildbot and Puppet masters to support these 
slaves 
• vcssync servers 
Buildbot has stateful connections and having connections to slaves in another DC did not work well 
So we created buildbot and puppet masters in AWS to support the slaves we instantiate there. We also have vcssync servers in AWS which support a 
service that maintains bidirectional commits between our hg and git servers.
Not in cloud 
• performance tests 
• graphics tests 
• Builds and tests for 
• Windows 
•Macs 
-Need bare hardware for predictable performance results 
-Graphics tests that need a specific card 
-Might be possible to build on Windows in the future 
-Macs - not available in AWS. Apple licensing prohibits more than two virtual machines on the same Mac. I investigated the possibility of outsourcing 
them to a “mac in cloud vendor” earlier this year but this is really just “mac in racks in another dc”
Where’s the code? 
• The tools we use are all open source 
• https://github.com/mozilla/build-cloud-tools 
• Which use boto libraries (Python interface to AWS) 
https://github.com/boto/boto 
The code we use to interact with AWS APIs resides here
Smarter Bidding Algorithms 
• Important scripts 
• aws_stop_idle.py 
• aws_watch_pending.py 
-stop_idle stops instances that are no longer needed given our current capacity (idle for a certain time period - threshold depends on if on-demand or 
spot) 
-aws_watch_pending activates instances given the criteria on the next slide
Regions and instances 
• Run instances in multiple regions 
• Start instances in cheaper regions first 
• Automatically shut down inactive instances 
• Start instances that have been recently running 
If you look at aws_watch_pending.py, these are some of the rules that it implements 
! 
We also use machines in multiple AWS regions, in case one region went down, and also to incur cost savings (some regions are cheaper). Currently we 
only use us-east1 and us-west2. Since all of our CI infrastructure resides in California, we don’t use most other regions. Unlike some companies that need 
to have instances available instantly - for instance I recently saw a talk by Bridget Kromhout (http://bridgetkromhout.com/speaking/2014/ 
beyondthecode/), an operations engineer from DramaFever. This company provides international movies content on demand. They use every single AWS 
region because there customer base is so distributed. 
! 
Better build times and lower costs if you start instances that have recently been running (still retain artifact dirs, billing advantages) 
!
Use spot instances 
• Use spot instances vs on demand instances 
• much cheaper 
• however not brought up as quickly 
• Useful for tests not builds 
Amazon has many different types of instances. Initially, we used on demand instances. They come up very quickly but cost more per hour than other 
instance types. 
! 
Spot instances are Amazon way of bidding off excess capacity. You can bid for the instance and if nobody else bids for it at a price above your offer, the 
spot instances will be instantiated for you. However, if you’re running a spot instance and someone bids a price higher than you did, your instance can be 
killed. But that’s okay because we have configured Buildbot to retry jobs that failed and a very small percentage are killed this way (< 1%) 
! 
Since the spot instances aren’t available as quickly as the on-demand instances, some tests don’t start within 15 minutes but that’s okay. To reduce costs, 
we initially started using spot instances for some of our test slaves. 
! 
Spot instances are instantiated every time with the AMI you specify. So they aren’t really appropriate for builds, because we run incremental builds and 
having the build artifacts on the disk is useful when rerunning the same build type on the machine to reduce wall time. 
! 
Other notes 
Smart bidding spot bidding library https://bugzilla.mozilla.org/show_bug.cgi?id=972562
Minimum viable instance type 
• Run more tests in parallel on a cheaper instance 
types rather than upgrading instance type 
• Most tests run on m3.medium but some need more 
• Limit the subset of tests run on more expensive 
instance types to those that actually need it 
Our tests have a timeout for a suite of tests. If they don’t complete within this timeout, they fail and retry. 
It’s much cheaper to run more tests in parallel on a cheaper instance type, than run on a more expensive instance type due to the scale of our operations 
! 
For instance, we have Android tests that run on Emulators on AWS. Some of the reference tests required a c3.xlarge to run. 
The correctness tests were fine to run on m3.medium
Limit EBS use 
• EBS is network attached store to the EC2 VM 
• Much cheaper to use the disk that comes with the 
instance type
So that was good. We had a lot more capacity on our CI farm. But with every change, you encounter some new bottlenecks. 
! 
At Mozilla, when a lot of jobs are failing, we say the trees are burning. 
!!! 
http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html 
! 
—- 
Reference 
http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/ 
http://armenzg.blogspot.ca/search?updated-max=2014-02-27T14:07:00-05:00&max-results=3 
!
Bottleneck: Network 
• Firewall for VPN tunnel between Mozilla and AWS 
couldn’t keep up 
• High latency connecting to scheduler database 
• Jobs weren’t scheduled so unhappy developers! 
All of our traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center. And the 
firewall couldn’t keep up. And thus there was a lot of latency connecting to our scheduler database to add jobs.
Solution 
• Created ftp-ssl endpoint 
• gave our AWS instances public addresses (before only had 
private) 
• Changed our routing tables in AWS to route traffic to ftp-ssl via 
the public internet rather than our VPN tunnel 
• updated builds config to download files from ftp-ssl vs ftp 
• changed scripts to cache some repos locally vs cloning each 
time 
• added more capacity to the firewall
Cache all the things 
• Reduce our VPN network utilization further 
• Implement a tool called proxxy 
• Cache build artifacts 
• Cache static tools 
• Note: This increased costs because we increased 
our reliance on EBS 
AWS region-local caches for https stuff 
https://bugzilla.mozilla.org/show_bug.cgi?id=1017759 
! 
https://wiki.mozilla.org/ReleaseEngineering/Applications/ 
Proxxy 
http://atlee.ca/blog/posts/cache-em-all.html
Bandwidth 50% 
Source: Chris Atlee http://atlee.ca/blog/posts/cache-em-all.html
- Another issue. Increased wall times. 
- We run incremental builds. This means that if a machine has recently run a certain build type, it will run faster the next time it runs that same type 
because it will just update the existing files on the machine, such as checkouts and object directories. 
- With the switch to AWS, we had large pools of devices allocated to certain types of builds. This means that a build might not run on a machine that has 
recently ran a build of the same type. So a couple of people on the team looked at enabling smaller pools of build machines for certain types of builds. 
The nickname for these smaller types of build pools is a jacuzzi. 
- Given a smaller pool, there would be a higher chance that the previous artifacts remain the next time the job ran. 
- We use a tool called mock that installs packages in a virtual environments. They also optimized mock environments so that packages weren’t reinstalled 
if they already existed. 
- These changes improved build times on these machines by 50%. 
—— 
Reference 
http://atlee.ca/blog/posts/initial-jacuzzi-results.html 
http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ 
http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html
Source: Chris Atlee http://atlee.ca/blog/posts/initial-jacuzzi-results.html
Puppet vs AMIs 
• Originally used puppet to manage all our of build 
and test instances 
• It was too slow to puppetize the spot instances 
• Solution: Create golden AMIs from configs each 
night. These are used to instantiate the new spot 
instances. 
note: we still use Puppet to manage our buildbot masters within AWS
Summary costs 
• Optimize use of regions, instance type and capacity 
• Use spot instances 
• Smarter bidding algorithms 
• Shorter wall time through use of jacuzzis 
• Use instance storage vs EBS to save $ 
• Route over public internet where possible 
• Cache artifacts the use of proxxy tool
This chart shows the number of monthly pushes in the last six years. You can see that in 2014 our volume has increased significantly. (Doubled compared 
with the beginning of 2013) For instance, last month we had 12821 pushes (October)
This chart shows our monthly AWS bill since we started migrating machines. You can see how there is quite a dramatic drop off in our monthly AWS costs 
despite our increased load.
This chart shows the dollar per push. This does not include costs for on premise equipment, just AWS. 
! 
Note: Of course, some of these drops in cost are due to Amazon’s reduced prices over the years, not just our optimizations :-)
Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/ 
OverviewArchitectureDiagram 
For context, here is what our entire releng build pipeline looks like 
! 
Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram
And here are the parts highlighted that now reside in AWS. Linux and Android builds and tests. Buildbot and Puppet masters to support them. There is 
still some work left to do…. 
! 
The red circles indicate some parts that have been migrated, but do not indicate that all of that infrastructure has been migrated for that service. For 
instance, some of our buildbot masters now reside in AWS, but those that support our on premise equipment remain in our data centre.
Questions?
Learn more 
• @MozRelEng 
• http://planet.mozilla.org/releng/ 
• Mozilla Releng wiki https://wiki.mozilla.org/ 
ReleaseEngineering 
• IRC: channel #releng on moznet
Where’s the code? 
• Cloud tools: https://github.com/mozilla/build-cloud-tools 
• buildbot configs https://github.com/mozilla/build-buildbot-configs 
• builldbotcustom https://github.com/mozilla/build-buildbotcustom 
• Mozharness https://github.com/mozilla/build-mozharness 
• Mozpool https://github.com/mozilla/mozpool 
• Puppet configs https://github.com/mozilla/build-puppet
More Reading 1 
• Laura's talks on monitoring complex systems http://vimeo.com/album/ 
3108317/video/110088288 
• Armen’s talk on our hybrid infrastructure https://air.mozilla.org/problems-and- 
cutting-costs-for-mozillas-hybrid-ec2-in-house-continuous-integration/ 
• Move to AWS starting in 2012 
• http://atlee.ca/blog/posts/blog20121002firefox-builds-in-the-cloud.html 
• http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64- 
builds-to.html 
• http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html 
• http://rail.merail.ca/posts/firefox-unit-tests-on-ubuntu.html 
Scaling 
http://atlee.ca/blog/posts/bursty-load.html 
jacuzzis 
http://atlee.ca/blog/posts/initial-jacuzzi-results.html 
http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ 
Caching
More Reading 2 
• AWS spot instances vs reserved instances 
• http://atlee.ca/blog/posts/now-using-aws-spot-instances.html 
• http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html 
• http://rail.merail.ca/posts/ec2-spot-instances-experiments.html 
• http://taras.glek.net/blog/2014/05/09/how-amazon-ec2-got-15x-cheaper-in- 
6-months/ 
• http://taras.glek.net/blog/2014/03/05/more-and-faster-c-i-for-less-on-aws/ 
• AWS networking 
• http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html 
• http://rail.merail.ca/posts/using-dns-to-query-aws.html
More Reading 3 
• Scaling 
• http://atlee.ca/blog/posts/bursty-load.html 
• jacuzzis 
• http://atlee.ca/blog/posts/initial-jacuzzi-results.html 
• http://hearsum.ca/blog/experiments-with-smaller-pools-of-build- 
machines/ 
• Caching 
• http://atlee.ca/blog/posts/cache-em-all.html

Contenu connexe

Tendances

Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)William Yeh
 
CloudExpo 2018: Docker - Power Your Move to the Cloud
CloudExpo 2018: Docker - Power Your Move to the CloudCloudExpo 2018: Docker - Power Your Move to the Cloud
CloudExpo 2018: Docker - Power Your Move to the CloudElton Stoneman
 
Find your data - use GraphDB capabilities in XPages applications - and beyond
Find your data - use GraphDB capabilities in XPages applications - and beyond	Find your data - use GraphDB capabilities in XPages applications - and beyond
Find your data - use GraphDB capabilities in XPages applications - and beyond ICON UK EVENTS Limited
 
Java script anywhere. What Nombas was doing pre-acquisition.
Java script anywhere. What Nombas was doing pre-acquisition.Java script anywhere. What Nombas was doing pre-acquisition.
Java script anywhere. What Nombas was doing pre-acquisition.Brent Noorda
 
Dr. Strangelove, or how I learned to love plugin development
Dr. Strangelove, or how I learned to love plugin developmentDr. Strangelove, or how I learned to love plugin development
Dr. Strangelove, or how I learned to love plugin developmentUlrich Krause
 
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...Neven Cvetković
 
Modern websites in 2020 and Joomla
Modern websites in 2020 and JoomlaModern websites in 2020 and Joomla
Modern websites in 2020 and JoomlaGeorge Wilson
 
Developer South Coast 2018: Modernizing .NET Apps with Docker
Developer South Coast 2018: Modernizing .NET Apps with DockerDeveloper South Coast 2018: Modernizing .NET Apps with Docker
Developer South Coast 2018: Modernizing .NET Apps with DockerElton Stoneman
 
Developer South Coast 2018: Docker on Windows - The Beginner's Guide
Developer South Coast 2018: Docker on Windows - The Beginner's GuideDeveloper South Coast 2018: Docker on Windows - The Beginner's Guide
Developer South Coast 2018: Docker on Windows - The Beginner's GuideElton Stoneman
 
Minko - Why we created our own Flash platform and why you should care
Minko - Why we created our own Flash platform and why you should careMinko - Why we created our own Flash platform and why you should care
Minko - Why we created our own Flash platform and why you should careMinko3D
 
Building a Desktop Streaming console with Electron and ReactJS
Building a Desktop Streaming console with Electron and ReactJSBuilding a Desktop Streaming console with Electron and ReactJS
Building a Desktop Streaming console with Electron and ReactJSEmanuele Rampichini
 
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsWebinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsCodefresh
 

Tendances (13)

Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)
 
CloudExpo 2018: Docker - Power Your Move to the Cloud
CloudExpo 2018: Docker - Power Your Move to the CloudCloudExpo 2018: Docker - Power Your Move to the Cloud
CloudExpo 2018: Docker - Power Your Move to the Cloud
 
Find your data - use GraphDB capabilities in XPages applications - and beyond
Find your data - use GraphDB capabilities in XPages applications - and beyond	Find your data - use GraphDB capabilities in XPages applications - and beyond
Find your data - use GraphDB capabilities in XPages applications - and beyond
 
Java script anywhere. What Nombas was doing pre-acquisition.
Java script anywhere. What Nombas was doing pre-acquisition.Java script anywhere. What Nombas was doing pre-acquisition.
Java script anywhere. What Nombas was doing pre-acquisition.
 
Dr. Strangelove, or how I learned to love plugin development
Dr. Strangelove, or how I learned to love plugin developmentDr. Strangelove, or how I learned to love plugin development
Dr. Strangelove, or how I learned to love plugin development
 
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...
Cloud Foundry Summit Europe 2018 - Deveveloper Experience with Cloud Foundry ...
 
Modern websites in 2020 and Joomla
Modern websites in 2020 and JoomlaModern websites in 2020 and Joomla
Modern websites in 2020 and Joomla
 
Developer South Coast 2018: Modernizing .NET Apps with Docker
Developer South Coast 2018: Modernizing .NET Apps with DockerDeveloper South Coast 2018: Modernizing .NET Apps with Docker
Developer South Coast 2018: Modernizing .NET Apps with Docker
 
Developer South Coast 2018: Docker on Windows - The Beginner's Guide
Developer South Coast 2018: Docker on Windows - The Beginner's GuideDeveloper South Coast 2018: Docker on Windows - The Beginner's Guide
Developer South Coast 2018: Docker on Windows - The Beginner's Guide
 
Minko - Why we created our own Flash platform and why you should care
Minko - Why we created our own Flash platform and why you should careMinko - Why we created our own Flash platform and why you should care
Minko - Why we created our own Flash platform and why you should care
 
Instant ColdFusion with Vagrant
Instant ColdFusion with VagrantInstant ColdFusion with Vagrant
Instant ColdFusion with Vagrant
 
Building a Desktop Streaming console with Electron and ReactJS
Building a Desktop Streaming console with Electron and ReactJSBuilding a Desktop Streaming console with Electron and ReactJS
Building a Desktop Streaming console with Electron and ReactJS
 
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsWebinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
 

Similaire à How Mozilla Scaled Infrastructure on AWS to Reduce Costs While Handling Increased Load

Scaling mobile testing on AWS: Emulators all the way down
Scaling mobile testing on AWS: Emulators all the way downScaling mobile testing on AWS: Emulators all the way down
Scaling mobile testing on AWS: Emulators all the way downKim Moir
 
JAX 2014 - The PaaS to a better IT architecture.
JAX 2014 - The PaaS to a better IT architecture.JAX 2014 - The PaaS to a better IT architecture.
JAX 2014 - The PaaS to a better IT architecture.Sebastian Faulhaber
 
PHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on BluemixPHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on BluemixIBM
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developersDaniel Krook
 
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...Measure and Increase Developer Productivity with Help of Serverless at AWS Co...
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...Vadym Kazulkin
 
Automating Oracle Database deployment with Amazon Web Services, fabric, and boto
Automating Oracle Database deployment with Amazon Web Services, fabric, and botoAutomating Oracle Database deployment with Amazon Web Services, fabric, and boto
Automating Oracle Database deployment with Amazon Web Services, fabric, and botomjbommar
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Vadym Kazulkin
 
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017Amazon Web Services
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014Hojoong Kim
 
Azure Templates for Consistent Deployment
Azure Templates for Consistent DeploymentAzure Templates for Consistent Deployment
Azure Templates for Consistent DeploymentJosé Maia
 
Experiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure teamExperiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure teamBrian Benz
 
Cloud Foundry: Hands-on Deployment Workshop
Cloud Foundry: Hands-on Deployment WorkshopCloud Foundry: Hands-on Deployment Workshop
Cloud Foundry: Hands-on Deployment WorkshopManuel Garcia
 
Best Practices for couchDB developers on Microsoft Azure
Best Practices for couchDB developers on Microsoft AzureBest Practices for couchDB developers on Microsoft Azure
Best Practices for couchDB developers on Microsoft AzureBrian Benz
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
Nuts and bolts of running a popular site in the aws cloud
Nuts and bolts of running a popular site in the aws cloudNuts and bolts of running a popular site in the aws cloud
Nuts and bolts of running a popular site in the aws cloudDavid Veksler
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Emerson Eduardo Rodrigues Von Staffen
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...Amazon Web Services
 

Similaire à How Mozilla Scaled Infrastructure on AWS to Reduce Costs While Handling Increased Load (20)

Scaling mobile testing on AWS: Emulators all the way down
Scaling mobile testing on AWS: Emulators all the way downScaling mobile testing on AWS: Emulators all the way down
Scaling mobile testing on AWS: Emulators all the way down
 
DevOps demystified
DevOps demystifiedDevOps demystified
DevOps demystified
 
Introduction Into Docker Ecosystem
Introduction Into Docker EcosystemIntroduction Into Docker Ecosystem
Introduction Into Docker Ecosystem
 
JAX 2014 - The PaaS to a better IT architecture.
JAX 2014 - The PaaS to a better IT architecture.JAX 2014 - The PaaS to a better IT architecture.
JAX 2014 - The PaaS to a better IT architecture.
 
PHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on BluemixPHP Buildpacks in the Cloud on Bluemix
PHP Buildpacks in the Cloud on Bluemix
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developers
 
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...Measure and Increase Developer Productivity with Help of Serverless at AWS Co...
Measure and Increase Developer Productivity with Help of Serverless at AWS Co...
 
Automating Oracle Database deployment with Amazon Web Services, fabric, and boto
Automating Oracle Database deployment with Amazon Web Services, fabric, and botoAutomating Oracle Database deployment with Amazon Web Services, fabric, and boto
Automating Oracle Database deployment with Amazon Web Services, fabric, and boto
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
 
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017
A Tale of Two Pizzas: Developer Tools at AWS - DevDay Los Angeles 2017
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
 
Azure Templates for Consistent Deployment
Azure Templates for Consistent DeploymentAzure Templates for Consistent Deployment
Azure Templates for Consistent Deployment
 
Experiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure teamExperiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure team
 
Cloud Foundry: Hands-on Deployment Workshop
Cloud Foundry: Hands-on Deployment WorkshopCloud Foundry: Hands-on Deployment Workshop
Cloud Foundry: Hands-on Deployment Workshop
 
Best Practices for couchDB developers on Microsoft Azure
Best Practices for couchDB developers on Microsoft AzureBest Practices for couchDB developers on Microsoft Azure
Best Practices for couchDB developers on Microsoft Azure
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
Nuts and bolts of running a popular site in the aws cloud
Nuts and bolts of running a popular site in the aws cloudNuts and bolts of running a popular site in the aws cloud
Nuts and bolts of running a popular site in the aws cloud
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 

How Mozilla Scaled Infrastructure on AWS to Reduce Costs While Handling Increased Load

  • 1. Capacity While Cash Kim Moir, Mozilla @kmoir URES, Seattle, Nov 10, 2014 Good afternoon. My name is Kim Moir and I’m a release engineer at Mozilla. Today I’m going to discuss how Mozilla scaled our infrastructure on AWS to handle the increasing load on our continuous integration farm, while reducing our monthly bills at the same time. —- References Montreal Subway picture https://www.flickr.com/photos/dephineprieur/3841791164/sizes/o/
  • 2. Mozilla is a non-profit. Our mission is to promote openness, innovation & opportunity on the Web. ! You’re probably familiar with the products we build, such as Firefox for Desktop and Android and Firefox OS. Firefox OS is a relatively new product that Mozilla started working on a few years ago. It’s an open source operating system for smartphones. When there was a new product coming on line, we knew that we would have to be able to scale our build farm to handle additional load. ! Note that we ship Firefox on four platforms and with ~97 locales on the same day as US English
  • 3. Our release cadence is every six weeks for Firefox for Desktop and Android. We release betas every week. FirefoxOS is on a different cadence. https://wiki.mozilla.org/RapidRelease
  • 4. Release Engineering are a very geographically distributed team and many of us work remotely. Even those people who work close to a physical Mozilla office work several days a week from home.
  • 5. Before I talk about how we scaled our build and test infrastructure on AWS, I’m going to talk a bit about the scale of our operations. How many builds we run, how many tests, the number of platforms, number of repositories etc. Image: https://www.flickr.com/photos/30649191@N00/9002545206/sizes/l
  • 6. Daily ! 4500 build jobs 70,000 test jobs Each time a developer lands a change, it invokes a series of builds and associated tests on relevant platforms. Within each test job there are many actual test suites that run. !!!
  • 7. We have a commitment to developers that build/test jobs should start within 15 minutes of being requested. We don’t have a perfect record on this, but certainly our numbers are good. We have metrics that measure this every day so we can see what platforms need additional capacity. And we adjust capacity as needed, and remove old platforms as they become less relevant in the marketplace. ! ——— Pizza picture https://www.flickr.com/photos/djwtwo/9864611814/sizes/l/
  • 8. Platforms • Windows • Mac • Linux • Android • all x many os versions We build and test on the following platforms. And many different releases of these platforms.
  • 9. You can see all the versions of the platforms we build for on this page, called http://treeherder.mozilla.org. It’s a web page anyone can see. Our developers look at it to see the results of their builds and tests, organized by branch.
  • 10. Devices • 5600+ in total • 1600+ for builds • 4000+ for tests We have lot of hardware used on our build farm, both in our two datacenters, and virtually (AWS) ! ——- References https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html * https://secure.pub.build.mozilla.org/slavealloc/ui/#silos
  • 11. Most companies that do a lot of mobile device testing just have a roomful of devices that developers can test on. ! We actually run continuous integration tests on Android reference cards. We have about 800 of them. They are called pandas and are rack mounted. These devices are not as stable as desktop devices, and are prone to failure. Given their numbers, having to deal with the machines failing all the time is very expensive if they were managed by humans. ! As an aside, the failure rate on these reference devices is much higher (18%) than running the tests on emulators in AWS (2%) ! ___ References Pictures of Panda chassis from Dustin’s blog https://blog.mozilla.org/it/2013/01/04/mozpool/2012-11-09-08-30-03/
  • 12. Bursty traffic - you can see that the number of jobs run each day is variable as time zones wake up. The large trough is obviously the weekend. ! source: http://atlee.ca/blog/posts/bursty-load.html
  • 13. We also have a lot of repositories to manage
  • 14. We have many different branches in Hg at Mozilla. Our Hg branches are all named after different tree species Developers push to different branches depending on their purpose. Different various branches have to different scheduling priorities within our continuous integration engine. So for instance, if a change is landed in a mozilla-beta branch, the builds and tests associated with that change will have machines allocated to them with at a higher priority than if a change was landed on a cedar branch which is just for testing purposes.
  • 15. + many Mozilla tools Here are some of projects that we use in our infrastructure. ! Buildbot is our continuous integration engine. It’s an open source project written in Python. We spend a lot of time writing Python to extend and customize it. ! We use Puppet for configuration management all our Buildbot masters, and the Linux and Mac slaves. So when we provision new hardware, we just boot the device and it puppetizes based on it’s role that’s defined by it’s hostname. ! Our repository of record is hg.mozilla.org but developers also commit to git repos and these commits are transferred to the hg repository. We also use a lot of mozilla tools that allow us to scale. These tools are open source as well and I have links at the end of the talk to these repos. ! —— References octokitty http://www.flickr.com/photos/tachikoma/2760470578/sizes/l/
  • 16. This is a picture of how the different parts of our build farm work together. Developers land change on code repositories such as hg.mozilla.org. ! As I mentioned before, we use an open source continuous integration engine called Buildbot. We have over 50 buildbot masters. Masters are segregated by function to run tests, builds, scheduling, and try. Test and build masters are further divided by function so we can limit the type of jobs they run and the types of slaves they serve. For instance, a master may have Windows build slaves allocated to it. Or Android test slaves. This makes the masters more efficient because you don’t need to have every type of job loaded and consuming memory. It also makes maintenance more efficient in that you can bring down for example, Android test masters for maintenance without having to touch other platforms. ! Buildbot polls the hg push log for each of the code repositories. (Hgpoller) ! When the poller detects a change, the information about the change is written into the scheduler database. The buildbot scheduler masters are responsible for taking this request in the database and creating a new build request. The build request then will appear as pending in the web page in the previous slide. ! The jobs may be on existing hardware in our data centre, or new VMs may start or be created in the cloud to run these pending jobs.
  • 17. This is a street in Bangkok. As you can see, lots of traffic, not much movement. There used to be a problem at Mozilla where some platforms didn’t have good wait times because we simply didn’t have the slave capacity to handle them. Many pending tests. This was a source of frustration for developers. ! We used to run all our builds and tests on in-house hardware in our data centres. This was inefficient in that it took a long time to acquire, rack and install the machines and burn them in. Also, we could not dynamically bring machines up to deal with peak load, and then put them offline when they were no longer needed. !
  • 18. So in early 2012 we started investigating how we could better scale this traffic. ! We investigated running jobs on AWS starting with CentOS machines. One of the things that allowed us to move to AWS more easily was that we use Puppet to manage the configuration many of our build and test slaves (exception is Windows and Android). Our puppet modules are role based so the modifications required to add Amazon VMs were not that difficult. ! This move to AWS provided additional capacity, and some of the machines in our data centres were repurposed to pools that were lacking capacity. —- Reference Cloud picture http://www.flickr.com/photos/paul-vallejo/2359829594/sizes/l/
  • 19. AWS Terminology • EC2 - Elastic compute 2 - machines as VMs • EBS - Elastic block store - network attached storage • Region - separate geographical area • Availability zone - Multiple, isolated locations within a region I’m going to talk a bit about some AWS terms for those of you that may not be familiar with them. ! Notes: AWS instance types http://aws.amazon.com/ec2/instance-types/
  • 20. More AWS terms • AMI - Amazon machine image • instance type - VM with defined specifications and cost per hour. For example: -AMIs - Amazon has standard ones that you can modify or create your own -pricing on instance types can depend on the region -m3.medium currently costs around $0.07hr in most regions -Some instance types may not be available in all availability zones
  • 21. Source: http://oduinn.com/blog/2012/11/27/releng-production-systems-now-in-3-aws-regions/ We have most of our servers in us-east1 and us-west2. us-west-1 doesn’t have much in it right now, it would be used as a hot backup if one of the other regions went down. Also some traffic is routed over the internet now (ftp via ssl) ! 2 in-house data centres 3 AWS regions VPC (private cloud for us within Amazon) VPN link between the our data centres and Amazon ! Other notes: -using internet for VCS traffic is also part of the story = IPSEC tunnel is a limited and expensive resource, by moving traffic that has “built in” security and integrity checking out of the tunnel, allowing greater capacity ! 60% of our capacity is in AWS. This number does not reflect the amount of traffic just the amount of available devices
  • 22. We migrated • Linux build and subset of test slaves • Builds for Android and tests on Android Emulators • Buildbot and Puppet masters to support these slaves • vcssync servers Buildbot has stateful connections and having connections to slaves in another DC did not work well So we created buildbot and puppet masters in AWS to support the slaves we instantiate there. We also have vcssync servers in AWS which support a service that maintains bidirectional commits between our hg and git servers.
  • 23. Not in cloud • performance tests • graphics tests • Builds and tests for • Windows •Macs -Need bare hardware for predictable performance results -Graphics tests that need a specific card -Might be possible to build on Windows in the future -Macs - not available in AWS. Apple licensing prohibits more than two virtual machines on the same Mac. I investigated the possibility of outsourcing them to a “mac in cloud vendor” earlier this year but this is really just “mac in racks in another dc”
  • 24. Where’s the code? • The tools we use are all open source • https://github.com/mozilla/build-cloud-tools • Which use boto libraries (Python interface to AWS) https://github.com/boto/boto The code we use to interact with AWS APIs resides here
  • 25. Smarter Bidding Algorithms • Important scripts • aws_stop_idle.py • aws_watch_pending.py -stop_idle stops instances that are no longer needed given our current capacity (idle for a certain time period - threshold depends on if on-demand or spot) -aws_watch_pending activates instances given the criteria on the next slide
  • 26. Regions and instances • Run instances in multiple regions • Start instances in cheaper regions first • Automatically shut down inactive instances • Start instances that have been recently running If you look at aws_watch_pending.py, these are some of the rules that it implements ! We also use machines in multiple AWS regions, in case one region went down, and also to incur cost savings (some regions are cheaper). Currently we only use us-east1 and us-west2. Since all of our CI infrastructure resides in California, we don’t use most other regions. Unlike some companies that need to have instances available instantly - for instance I recently saw a talk by Bridget Kromhout (http://bridgetkromhout.com/speaking/2014/ beyondthecode/), an operations engineer from DramaFever. This company provides international movies content on demand. They use every single AWS region because there customer base is so distributed. ! Better build times and lower costs if you start instances that have recently been running (still retain artifact dirs, billing advantages) !
  • 27. Use spot instances • Use spot instances vs on demand instances • much cheaper • however not brought up as quickly • Useful for tests not builds Amazon has many different types of instances. Initially, we used on demand instances. They come up very quickly but cost more per hour than other instance types. ! Spot instances are Amazon way of bidding off excess capacity. You can bid for the instance and if nobody else bids for it at a price above your offer, the spot instances will be instantiated for you. However, if you’re running a spot instance and someone bids a price higher than you did, your instance can be killed. But that’s okay because we have configured Buildbot to retry jobs that failed and a very small percentage are killed this way (< 1%) ! Since the spot instances aren’t available as quickly as the on-demand instances, some tests don’t start within 15 minutes but that’s okay. To reduce costs, we initially started using spot instances for some of our test slaves. ! Spot instances are instantiated every time with the AMI you specify. So they aren’t really appropriate for builds, because we run incremental builds and having the build artifacts on the disk is useful when rerunning the same build type on the machine to reduce wall time. ! Other notes Smart bidding spot bidding library https://bugzilla.mozilla.org/show_bug.cgi?id=972562
  • 28. Minimum viable instance type • Run more tests in parallel on a cheaper instance types rather than upgrading instance type • Most tests run on m3.medium but some need more • Limit the subset of tests run on more expensive instance types to those that actually need it Our tests have a timeout for a suite of tests. If they don’t complete within this timeout, they fail and retry. It’s much cheaper to run more tests in parallel on a cheaper instance type, than run on a more expensive instance type due to the scale of our operations ! For instance, we have Android tests that run on Emulators on AWS. Some of the reference tests required a c3.xlarge to run. The correctness tests were fine to run on m3.medium
  • 29. Limit EBS use • EBS is network attached store to the EC2 VM • Much cheaper to use the disk that comes with the instance type
  • 30. So that was good. We had a lot more capacity on our CI farm. But with every change, you encounter some new bottlenecks. ! At Mozilla, when a lot of jobs are failing, we say the trees are burning. !!! http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html ! —- Reference http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/ http://armenzg.blogspot.ca/search?updated-max=2014-02-27T14:07:00-05:00&max-results=3 !
  • 31. Bottleneck: Network • Firewall for VPN tunnel between Mozilla and AWS couldn’t keep up • High latency connecting to scheduler database • Jobs weren’t scheduled so unhappy developers! All of our traffic from our infrastructure in EC2 was routed over the VPN tunnel to be handled by Mozilla's firewall in our SCL3 data center. And the firewall couldn’t keep up. And thus there was a lot of latency connecting to our scheduler database to add jobs.
  • 32. Solution • Created ftp-ssl endpoint • gave our AWS instances public addresses (before only had private) • Changed our routing tables in AWS to route traffic to ftp-ssl via the public internet rather than our VPN tunnel • updated builds config to download files from ftp-ssl vs ftp • changed scripts to cache some repos locally vs cloning each time • added more capacity to the firewall
  • 33. Cache all the things • Reduce our VPN network utilization further • Implement a tool called proxxy • Cache build artifacts • Cache static tools • Note: This increased costs because we increased our reliance on EBS AWS region-local caches for https stuff https://bugzilla.mozilla.org/show_bug.cgi?id=1017759 ! https://wiki.mozilla.org/ReleaseEngineering/Applications/ Proxxy http://atlee.ca/blog/posts/cache-em-all.html
  • 34. Bandwidth 50% Source: Chris Atlee http://atlee.ca/blog/posts/cache-em-all.html
  • 35. - Another issue. Increased wall times. - We run incremental builds. This means that if a machine has recently run a certain build type, it will run faster the next time it runs that same type because it will just update the existing files on the machine, such as checkouts and object directories. - With the switch to AWS, we had large pools of devices allocated to certain types of builds. This means that a build might not run on a machine that has recently ran a build of the same type. So a couple of people on the team looked at enabling smaller pools of build machines for certain types of builds. The nickname for these smaller types of build pools is a jacuzzi. - Given a smaller pool, there would be a higher chance that the previous artifacts remain the next time the job ran. - We use a tool called mock that installs packages in a virtual environments. They also optimized mock environments so that packages weren’t reinstalled if they already existed. - These changes improved build times on these machines by 50%. —— Reference http://atlee.ca/blog/posts/initial-jacuzzi-results.html http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html
  • 36. Source: Chris Atlee http://atlee.ca/blog/posts/initial-jacuzzi-results.html
  • 37. Puppet vs AMIs • Originally used puppet to manage all our of build and test instances • It was too slow to puppetize the spot instances • Solution: Create golden AMIs from configs each night. These are used to instantiate the new spot instances. note: we still use Puppet to manage our buildbot masters within AWS
  • 38. Summary costs • Optimize use of regions, instance type and capacity • Use spot instances • Smarter bidding algorithms • Shorter wall time through use of jacuzzis • Use instance storage vs EBS to save $ • Route over public internet where possible • Cache artifacts the use of proxxy tool
  • 39. This chart shows the number of monthly pushes in the last six years. You can see that in 2014 our volume has increased significantly. (Doubled compared with the beginning of 2013) For instance, last month we had 12821 pushes (October)
  • 40. This chart shows our monthly AWS bill since we started migrating machines. You can see how there is quite a dramatic drop off in our monthly AWS costs despite our increased load.
  • 41. This chart shows the dollar per push. This does not include costs for on premise equipment, just AWS. ! Note: Of course, some of these drops in cost are due to Amazon’s reduced prices over the years, not just our optimizations :-)
  • 42. Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/ OverviewArchitectureDiagram For context, here is what our entire releng build pipeline looks like ! Selena Deckelman’s architecture diagram https://wiki.mozilla.org/ReleaseEngineering/OverviewArchitectureDiagram
  • 43. And here are the parts highlighted that now reside in AWS. Linux and Android builds and tests. Buildbot and Puppet masters to support them. There is still some work left to do…. ! The red circles indicate some parts that have been migrated, but do not indicate that all of that infrastructure has been migrated for that service. For instance, some of our buildbot masters now reside in AWS, but those that support our on premise equipment remain in our data centre.
  • 45. Learn more • @MozRelEng • http://planet.mozilla.org/releng/ • Mozilla Releng wiki https://wiki.mozilla.org/ ReleaseEngineering • IRC: channel #releng on moznet
  • 46. Where’s the code? • Cloud tools: https://github.com/mozilla/build-cloud-tools • buildbot configs https://github.com/mozilla/build-buildbot-configs • builldbotcustom https://github.com/mozilla/build-buildbotcustom • Mozharness https://github.com/mozilla/build-mozharness • Mozpool https://github.com/mozilla/mozpool • Puppet configs https://github.com/mozilla/build-puppet
  • 47. More Reading 1 • Laura's talks on monitoring complex systems http://vimeo.com/album/ 3108317/video/110088288 • Armen’s talk on our hybrid infrastructure https://air.mozilla.org/problems-and- cutting-costs-for-mozillas-hybrid-ec2-in-house-continuous-integration/ • Move to AWS starting in 2012 • http://atlee.ca/blog/posts/blog20121002firefox-builds-in-the-cloud.html • http://johnnybuild.blogspot.ca/2012/08/migrating-linux32-and-linux64- builds-to.html • http://atlee.ca/blog/posts/blog20121214behind-the-clouds.html • http://rail.merail.ca/posts/firefox-unit-tests-on-ubuntu.html Scaling http://atlee.ca/blog/posts/bursty-load.html jacuzzis http://atlee.ca/blog/posts/initial-jacuzzi-results.html http://hearsum.ca/blog/experiments-with-smaller-pools-of-build-machines/ Caching
  • 48. More Reading 2 • AWS spot instances vs reserved instances • http://atlee.ca/blog/posts/now-using-aws-spot-instances.html • http://rail.merail.ca/posts/firefox-builds-are-way-cheaper-now.html • http://rail.merail.ca/posts/ec2-spot-instances-experiments.html • http://taras.glek.net/blog/2014/05/09/how-amazon-ec2-got-15x-cheaper-in- 6-months/ • http://taras.glek.net/blog/2014/03/05/more-and-faster-c-i-for-less-on-aws/ • AWS networking • http://atlee.ca/blog/posts/aws-networks-and-burning-trees.html • http://rail.merail.ca/posts/using-dns-to-query-aws.html
  • 49. More Reading 3 • Scaling • http://atlee.ca/blog/posts/bursty-load.html • jacuzzis • http://atlee.ca/blog/posts/initial-jacuzzi-results.html • http://hearsum.ca/blog/experiments-with-smaller-pools-of-build- machines/ • Caching • http://atlee.ca/blog/posts/cache-em-all.html