The value of containers is widely touted, but running them securely at scale and in long lived production environments presents new challenges. Amazon EC2 Container Service (ECS) changes the game by delivering cluster management and scheduling as a service. In this talk we’ll present how Okta uses ECS for parallelized testing in CI and for production microservices in a multi-region, always on cloud service. Learn why we chose ECS and many of the tips and tricks for securing, scaling and managing cost.
5. Thousands of Enterprise Customers
Ed, Gov,
Non-Profit
Services Media ConsumerTechnology Manufacturing,
Energy
FinanceCloudHealth
6. Okta Application Network
Mobility ManagementSingle Sign On Adaptive MFA Provisioning
Universal Directory
Extensible Profiles, Attribute Transformations,
Directory Integration and AD Password Management
Secure SSO for All Your
Web Apps, On-prem
and Cloud, with Flexible
Policy, from Any Device
Contextual Access
Policies,
Modern Factors,
Adaptive Authentication,
Integrations for Apps
and VPNs
Lifecycle Management,
Cloud & On-prem App
Integration, Mastering
from Apps, Directory
Provisioning, Rules,
Workflow, Reporting
Tight User Identity
Integration, Device
Based Contextual
Access,
Light-weight
Management
Okta IT & Platform products
7. The most reliable IDaaS available
Never taken offline for upgrades
Redundant and scalable
A B C A B C
DC2 DC1
okta.com/trust
A Platform Architecture For Scale
DATA TIER
A B C LOAD
BALANCERS
APP
SERVERS
11. Defining a pattern for micro-services
https://www.pinterest.com/pin/205828645447534387/
http://www.bennysbaker.com/poop-emoji-cupcakes/
12. DevOps abstraction layer
Inspired by: http://dev2ops.org/2010/02/what-is-devops/
Dev OpsWall of turmoil
Dev Ops
I want stabilityI want change
Domain boundary
13. Repeatability through immutability
• Same runtime environment
dev / test / prod
• Runtime versioned w/ code
• Easy reproducibility
• All changes use same
release process
14. Additional requirements
• 0-downtime deployments
• Support for our multi-az & multi-region architecture
• Compliance – SOC2 type 2, HIPAA, ISO 27001
• Separation of duties – a.k.a. no developer access to production hosts
• Push button deployment
• Rollback and canary support
21. Additional concepts
• Task Definitions define one or more containers to run.
• Services define a long running task and run inside a cluster
• Clusters define a set of EC2 resources that can be shared by more
than one service
• Auto scaling groups can be used to define size and launch
configuration of a cluster
26. 1. Lambda: Task which scales cluster based on queue
2. Lambda: Inspect running tasks an bin pack new tasks where possible
• This is one of the changes we had to make in order to use ECS for long running tasks,
rather than long running services spread across many stateless instances
• Disconnects unneeded nodes from cluster allowing themselves to self terminate when they
are idle
Why ECS - Dynamic worker scaling
VS
29. Feature Requests
• Ability to have spot and on-demand in same Auto Scaling Group (ASG)
• Built-in bin packing scheduler
• Give ASG a termination policy based on ECS status
• i.e. prefer instances with no running tasks
30. Termination policy
• OldestInstance. Auto Scaling terminates the oldest instance in the group. This
option is useful when you're upgrading the instances in the Auto Scaling group to a
new EC2 instance type, so you can gradually replace instances of the old type with
instances of the new type.
• NewestInstance. Auto Scaling terminates the newest instance in the group. This
policy is useful when you're testing a new launch configuration but don't want to
keep it in production.
• OldestLaunchConfiguration. Auto Scaling terminates instances that have the
oldest launch configuration. This policy is useful when you're updating a group and
phasing out the instances from a previous configuration.
• ClosestToNextInstanceHour. Auto Scaling terminates instances that are closest to
the next billing hour. This policy helps you maximize the use of your instances and
manage costs.
• Default. Auto Scaling uses its default termination policy. This policy is useful when
you have more than one scaling policy associated with the group.
31. Takeaways
• ECS is running well for us in a 150+ instance cluster
• Bake AMI with large files and common images into host machines
• Spot instances give 2 min warning. Keeps jobs short
39. Workflow
Auto Scaling Group
Launch Config
EC2
ECS Cluster
ECS
Service
ECS
Canary
Service
Application YAML
Docker Registry
(Artifactory)
ELB
Images pulled
when tasks start
Conductor
(Bastion ECS Controller)
CI Pipeline
Git Repo
Promoted artifactsDockerfile
docker_compose.yml
Test / Preview / ProductionDev
Deploy new version
40. Application definition
• Developers define YAML for
their application
• Deploy time configuration is
supplied to the ECS task
definition
• Secrets are pulled by the
application at startup
41. Security conventions
• Container repository
• Only allow containers from internal repository
• IAM separation per service
• Either service per cluster or use new IAM for ECS functionality
• Security scanning of containers - JFrog Xray
• Process monitoring on docker host – cAdvisor from google
• Secrets or any form of config NEVER baked in containers
• Start from minimal, audited base OS
• Run container as non-privileged user w/ user namespaces Docker
1.10+
• Monitor alas.aws.amazon.com for critical updates
42. Source Conventions
• 3 categories of container definitions
1. “Library” definitions used as the basis for building other images
2. Third-party service definitions e.g. Zookeeper or Elasticsearch
3. Internal service definitions
• Repo per internal service
• Dockerfile in same repo => image versioned with code
• Docker compose for running dependent services
• Pegged versions (no builds)
• Single repo for library and third-party service definitions
43. Build Conventions
• Integration tests run against code running in container
• Build owns creating immutable version and publishing to artifact server
• Strict rules around “FROM” clause
• Must point at internal artifact server
• Must be tagged following SEMVER-SHORT_SHA convention
• Never allow missing or use of “latest” tag for repeatable builds
44. Logging and monitoring
• Logging
• All output streams pipe to STDOUT/STDERR of the running process
• Log forwarding is provided by underlying host
• Log entries contain
• Host
• Container Id
• Image name & version
• Request Id
• Metrics
• Host level, generic container metrics provided by host
• App level metrics published directly to well defined endpoints
45. Feature requests
• ELB
• Dynamic port mapping to containers
• Fail health based on HTTP return code
• Different health endpoint for adding vs removing
• Service level security groups
• Service discovery w/o ELB
• Ability to mark container instances as un-schedulable
• Remove sharp edges around the stopped state
• Give ASG ability to set EC2 ”shutdown behavior”
• Periodic cleanup process in ECS to deregister stopped instances
46. Takeaways
• /etc/ecs/ecs.config
• ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION for forensics (default 1hr)
• ECS_LOGLEVEL=debug
• Beware of running services in same cluster that use the same ports
• Tune ELB health check
• Docker 1.10 for security enhancements
• Canary & Blue/Green separate service attached to same ELB
• Rollback is trivial
• ECS is incredibly easy to get up and running
• The ecosystem is changing quickly, we are moving cautiously
• ECS team has made a lot of improvements
47. Dev OpsWall of turmoil
Automated pipeline of awesomeness
49. Thank You
Follow me @JonToddDotCom
Join us @Okta - www.okta.com/company/careers/
Notes de l'éditeur
How many have heard of Okta?
Used it?
This is the full set of Okta IT and Platform products, 100-cloud based and integrated. Each of these are full featured products you could use to replace CA, RSA, or Airwatch.
Java backend
JS Front end
Entirely hosted in the cloud in AWS
In general we like using and giving back to open source
Same environment dev / test / prod
Environment should be versioned with code
Problems with chef mutating production with bad or incorrect version of config
Easy reproducibility
Security audit can be done on artifact and then just monitor runtime for correct version
All together we get a PATTERN FOR MICROSERVICES
- We run on ecs optimized
- Reduced packages
- Upgrade is easier
Developers can run all CI test on any topic branch
Master locked down, Bacon is the gate keeper
Jenkins used for job definition and lifecycle
Slave pool is ECS!
ECS run as short lived tasks
Each day we get between 100 & 150 containers at peak load
This is bacon
From before: main goal repeatability and immutability
Not only is the artifact and it’s runtime immutable but the container which builds the artifact for testing is containerized
Solves classic problem: changes to environment in CI
Who has the knowledge about sizing?
We presently respond to Spot price termination notices( you get 2 minutes warning) by placing tasks running on a node to be terminated back into the queue to immediately get picked up by another node.
Currently working on recognition of spot price instance pool cascades, so we can switch to on demand.
No ability to have both spot and on demand in same ASG
Something to worry about. If the prices spike, and cause large outages, what is the availability of on demand instances?
We auto scale daily to around 150 instances and back down to under 20 daily.
Preload Maven, NPM, & git repositories. This saved us about 4 minutes on container start time
The integration point between CI and Deployment is artifactory. Any sign off or approval happens there
Autoscaling groups control pool of EC2 instances
Launch config sets environment variables for ECS config like cluster and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
ECS cluster per service due to IAM issues, looking forward to using new feature
1 or more services registered to a service ELB supports canary
Conductor is bastion service. Allows non-operators to perform deployments
Software is grouped into applications which may have multiple components
YAML defines all components. In this case we have an application with a single backend running in ECS
2016-07-26T18:56:08Z [INFO] Redundant container state change for task op1-sage:15 arn:aws:ecs:us-east-1:011750033084:task/8f9920cf-a289-44bb-ac43-e436d6fb84d7, Status: (RUNNING->RUNNING) Containers: [op1-sage-app (RUNNING->RUNNING),]: op1-sage-app(docker.aue1d.saasure.com/okta-sage:1_1_0_029796_ec67fd3) (RUNNING->RUNNING) to RUNNING, but already RUNNING