13. +----------------------+ +---------------------+
| | | |
| Any Branch Of Code | + | Web Zero Downtime |
| | | |
+--------------------+-+ +-+-------------------+
| |
| |
+----v------------v----+
| |
| Docker Container |
| |
+------+---+---+-------+
| | |
+-------------+---------+ | | | +---------------------+
| | | | | | | |
| Test Data |[devenv] | <-----------+ | +------------> | Production Data |
| | | v | |
+-------------+---------+ +---------------------+
+-------------------+
| |
| Staging Data |
| |
+-------------------+
PaaS
CI/CD
Self-service configuration
Monitoring Any OS
Python
Ruby
Logging
Security
Test everything
WIN! All The Things!
Notes de l'éditeur
To help visualize what we’ve done. Here is some great ascii art. Talk though the diagram.
We put all our code in a git repo so both teams could contribute to the environment. We made it possible to wipe the entire environment without involving IT by containing it all within a virtual machine, for that we used virtualbox and vagrant. By using docker we instantly had golden images of each service within of our infrastructure, which made sure it worked perfectly every time. And to boot it was all orchestrated by running a single shell command.
After 24 hours we replaced our entire dev environment and had several developers already up and running with it. To cap it off we won the award in our hackday for “most ready for production”.
The dev environment is still used today. Over the past month at RelateIQ we’ve upgraded to the latest major versions of Cassandra 2.0 and Elasticsearch 1.0. The Ops team is able to make these updates to the git repo and just shoot a message to the developers to run devenv update. Across our entire development team we are able to upgrade to the latest versions almost instantly. Not to mention if we forget to make a setting, developers can make the changes themselves as well. They can create their beloved tests and database schema changes whenever they want. We didn’t know it at the time but Docker started a transformation at RelateIQ. Dev and Ops was moving closer together.
SCOTT
Yeah, that hack day lives on in RelateIQ lore and has turned an 18 page environment setup document into just a few pages. Very earnestly there are these moments in computing history that stick out for me:
The first time I spun up NCSA Mosaic over a PPP connection and browsed the nascent web on Windows 3.11
The first time I played Doom, jaw-on-the-floor and soon-thereafter with my Dad’s 2 computers connected via 10-base-2, terminators and all, playing a game of deathmatch and shooting my friends virtually for the first time.
And now, that first time I ran docker, creating ready-to-run containers in milliseconds, iterating on dockerfiles with snapshots after each successful run. true devops bliss.
contrasting chef.. yeesh
SCOTT (continued)
So, once we had a great dev environment and had seen the power of Docker, I started dreaming about production uses, pre-1.0 be damned. At that time our web app deployment method was effectively, rsync the files over and restart the services. Sure, we had redundant stateless servers behind a load balancer, but we didn’t have the necessary orchestration or patience to develop something that would require serially upgrading our servers. We started getting tweets about prod pushes that happened at 2AM pacific from international clients. 5 minutes of downtime was enough to convince us we should scope out a solution.
We are a startup, so we have a lot of things to build and not enough people to build them (cough we’re hiring cough) and I had been having this recurring dream of a docker-based method of sneaking zero-downtime nearly seamlessly into our existing servers. Armed with Docker, another engineer, and myself could we do it in a week?
Enter “Project Zero Downtime”
First, we had to enhance our build agents to be able to run docker. Anyone want to guess how we did that? Docker of course. So, my compadre Jon Gretarrson built a TeamCity docker container, a TeamCity agent docker container, and a Docker-in-docker container. Our agents (running in docker) could all talk to the Docker container (running in Docker) in order to build our webapp Docker containers.
It sounds complicated, but it was up and running and pushing the existing prod build configurations in 2 days. We planned at first to use a private docker repository to push and pull these webapp containers to the prod servers, but .. yeah.. that particular service wasn’t ready yet. We settled temporarily at least on using docker save to package up the containers and distribute them to the servers.
Before this docker work we had integration testing using a combination of sketchy “embedded” versions of our various database dependencies, and a parallel infrastructure for testing. This was slow, flaky, and innacurate. Once we had docker accessible from our CI agents we easily swapped our tests to use containers spun up specifically for that test run. Containers that perfectly matched our production environment and versions.
Meanwhile, I was tasked with building the Dockerfile for our webapp, and a simple bash script to orchestrate a single-node to be able to be upgraded with zero downtime.
The design we settled on, while not necessarily perfect, was perfect for the time and place.
Heres how it works:
First, so we didn’t need to orchestrate between multiple virtual machines or spin up any new machines we wanted something that was zero downtime even if we had only 1 node. We also liked the idea of being able to roll out the new version in parallel across all our web nodes so the time when we had unmatching fe/be versions was shortest . To do this we needed something that was going to keep running while we replaced the current container version. We chose Hipache (another Dotclou.. errr. Docker product) for its simple live configurability.
The plan was to have hipache, running in a container of course, on each web node.
Then, we’d spin up the latest version of the webapp container, wait until it was healthy, and use hipache to reroute traffic to it.
Finally, once things were copacetic we’d tell hipache to remove the old container, and kill it off.
Now, I wouldn’t necessarily completely copy this method, for example you clearly need some spare memory to temporarily have 2 webapp containers going on the same node. The point, however, is that it worked for us, and it took 1 week to build something that made us heroes to Engineering, Product, Marketing, and Sales and has given us months of runway on building a more permanent solution.
Jon (not John, its confusing) ended up later spending a day or two bolting on a ZERO-ROLLBACK functionality that used the containers that already existed on the web nodes to be able to quickly, easily, and with zero-downtime roll back to a previously working release.
Zero-downtime deployments are one of those things that should be a requirement for any saas engineering team. The freedom it gives for quick and silent mid-day pushes lifts a huge weight off the shoulders of everyone.
We’ve had absolutely no docker-related downtime in the 8 months it has been a part of our production infrastructure. And in fact, because of this particular project, it has significantly increased our uptime.
The lines between Dev and Ops was completely blurred and Docker really brought us together on some great projects. We built several amazing things with Docker over the past year. For our most recent project we got a really crazy idea. What if we combined the automated infrastructure of our dev environment and modified our zero downtime project which baked in our web infrastructure and code together?
The reason docker is so important to accomplish this:
* didn’t have to worry about dependency collisions
* if you can run a container you can run it anywhere, thus if you have containers for all your dependencies, you can run your whole stack anywhere
* isolated networking means you dont have to figure out how to randomize ports for every single dependency (and the client consuming them!)
* isolated OS so you can run whatever version works best for a given piece of your stack
* super-fast iteration on creating the actual containers for different pieces of environment
* containers start up so quickly that you can spin these up and not worry about slowing a critical path
Enter the new project “Ubiquitous deployment”
The project’s goal was to take any branch of web code. Not just our production code off master. Build it within Teamcity anytime new code was pushed to a dockerme branch, along with starting up all the backend infrastructure on the same machine. By using linking functionality in Docker we were able to take the zero downtime web deployment and join it to the dev environment all with continuous deployment. However, we didn’t stop there. The app.properties settings were made adjustable so each piece of infrastructure could be redirected to a new location.
We were able to take any branch of code, point it to any data set, deploy at any time, and anywhere. Not really anywhere, it all runs on the same server. I just needed to make the slide symmetrical.
To give a couple examples of how we’ve used this deployment here are several.
First, We know every company does 3rd party penetration tests right? Well we do. Using the ubiquitous deployment project we were able to unleash the hackers against an isolated web environment pointing towards production data. We allowed our 3rd party pentesters to hack away at our app without worrying about affecting our production customers. No need to worry about those pesky DDoS attacks running during the day.
The best example was a couple months ago we completely redesigned the front end of our application.
All new CSS, modals, buttons, colors, and positioning. How were we going to get our Product team, Web Team, Quality Team, and Marketing team to work together on this project without being able to see the results every day? The answer was Ubiquitous deployment. We set up a new server running the redesign branch hitting staging data. This allowed every team to hit a url seeing the latest and greatest code and design. Product was able to sync with the web team and the quality team was able to test all the new changes. When we were ready we were able to redirect the same branch of code to our production servers. However, we quickly realized the new code was not ready. This saved us a ton of time. We had found so many small issues the launch of the new design would have been a flop. We pushed the new code to production two weeks later with an outstanding release. It was our biggest code merge with one of the highest quality pushes since we launched our site. It was a major success.
expand this section
Docker has transformed how our technology at RelateIQ works. The days of dev and ops at war is long over and now we work together on building the latest and greatest projects with Docker being loved by dev and Ops.
By the end of the year is for 75% of our infra to be running in docker containers
Thank you.
Any questions for John or myself?
Our emails are our first names @ relateiq.com … i’ll spell that for you J O B S @relateiq.com and mine is J O B S @ relateiq.com