This document discusses SPN's journey to implement CI/CD on AWS. It begins with describing SPN's original process for delivering services which involved many manual steps. It then discusses DevOps goals of faster delivery, lower failure rates, and faster recovery compared to the original process. The document outlines using AWS services like CloudFormation, OpsWorks, and Auto Scaling to implement CI/CD and automate deploying a sample analytic engine service. Lessons learned include automating as much as possible, splitting CloudFormation templates, focusing on updates without impacting SLAs, and emphasizing monitoring and testing.
Presentation on how to chat with PDF using ChatGPT code interpreter
SPN CI/CD journey on AWS: DevOps goals, CI/CD implementation and lessons learned
1. SPN CI/CD journey on AWS
SPN Infra., CoreTech
Scott Miao
11/22/2017
1
2. Who am I
• Scott Miao
• RD, SPN Infra., TrendMicro
• OOAD system dev. 10+ years
• Hadoop ecosystem 6 years
• AWS for BigData 4 years
• @linkedIn
• @slideshare
2
3. Agenda
• Original services delivery process in SPN
• Dev/Ops
– DevOps goals V.S. our original way
• CI/CD on AWS
• An example service CI/CD on AWS
• DevOps goals V.S. our original way V.S. CI/CD
on AWS
• Lessons learned
8. 8
DevOps is not a new technology or a
product. It’s an approach or culture of
software development that seeks stability
and performance at the same time that it
speeds software deliveries to the business.
── Andi Mann, CA Technology ──
Cited from: Derek Chen, RD, TrendMicro
https://www.slideshare.net/derekhound/devops-in-practice-78905911, p#15
9. 9
Software Delivery
Plan Release
Operat
e
Code Build DeployTest
Monito
r
Agile Development
Continuous Integration
Continuous Delivery
Continuous Deployment
DevOps
Cited from: Derek Chen, RD, TrendMicro
https://www.slideshare.net/derekhound/devops-in-practice-78905911, p#23
10. DevOps goals V.S. our original way
• Faster time to market
– Too complicated to miss steps
– Service team needs to follow up themselves
– Lead time needed steps (Machine resources, etc)
• Lower failure rate of new releases
– Manual steps lead to errors
• Shorten lead time between fixes
– Rolling upgrade
– Invasive
• Faster mean time to recovery
– Hard to deal with machine errors and peak
2https://en.wikipedia.org/wiki/DevOps#Goals
11. “Very often, automation supports
this objective”
https://en.wikipedia.org/wiki/DevOps#Goals
Quoted from Wikipedia for DevOps goals
12. CI/CD on AWS
TWO ACHIEVE SAME DEVOPS GOALS
DEVOPS FOCUSES ON ORGANIZATIONAL CHANGES
CI/CD FOCUSES ON TECHNICAL IMPLEMENTATIONS
13. Review for CI and CD
• Continuous Integration
– is the practice of merging all developer working
copies to a shared mainline (trunk) several times
a day
• Continuous Delivery
– produce software in short cycles, ensuring that
the software can be reliably released at any time
• Continuous Deployment
– means that every change is automatically
deployed to production
https://en.wikipedia.org/wiki/Continuous_integration
https://en.wikipedia.org/wiki/Continuous_delivery
14. Characteristics of Cloud Computing
• On-demand self-service
– A consumer can unilaterally provision computing capabilities
• Broad network access
– Capabilities are available over the network and accessed
through standard mechanisms
• Resource pooling
– The provider's computing resources are pooled to serve
multiple consumers using a multi-tenant model
• Rapid elasticity
– Capabilities can be elastically provisioned and released
• Measured service
– Cloud systems automatically control and optimize resource use
http://www.inforisktoday.com/5-essential-characteristics-cloud-computing-a-4189
https://en.wikipedia.org/wiki/Infrastructure_as_Code
16. AWS managed services SPN used
• AWS CloudFormation
– Gives developers and systems administrators an easy
way to create and manage a collection of related
AWS resources
– We use it to provision our service components
• Such as Load balancer (ALB), machines (EC2)
• AWS OpsWorks
– A configuration management service that uses Chef,
an automation platform that treats server
configurations as code
– We use it to deploy, configure and startup our
service components
https://aws.amazon.com/cloudformation/
https://aws.amazon.com/opsworks/
17. AWS CloudFormation + OpsWorks
user
main
IAM ELB OpsWorks
AWS
CloudFormation
main
IAM ALB OpsWorks
AWS
OpsWorks
artifacts
AWS S3
AWS
VPC
Chef recipes1. Put CF templates
2. Put artifacts
3. Put Chef recipes
4. Create CF W/ params,
VPC ID, etc
5. Templates
input
6. Create CF
stacks
7. Provision
AWS resources
8. Create OpsWorks
9. Artifacts/recipes
input
10.
Deploy/Config/start
up service
User
CF
Ops
Ready to
serve
18. CoreTech DCS managed services
• Enterprise github
– Just like the github we use on Internet
• CloudCI – Enterprise Circle CI
– A Docker container based CI solution
– Seamlessly integrated with github
• JFrog Artifactory
– A CoreTech wise shared artifacts repo.
20. Analytic Engine is an API service for…
Common Big Data computation
service on Cloud (AWS)
https://www.slideshare.net/takeshi_miao/analytic-engine-a-common-big-data-computation-service-on-the-aws
21. IDC
AE High Level Architecture Design
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP
Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB
22. IDC
This is really what we taking care about
AZb
AE API servers
RDS
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
VPN
HTTPS/HTTP
Basic
Cloud Storagepeering
isValidUser
CS output
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)
IDC
VPN
Splunk
peering
Private ALB
23. What components in CI/CD scope
• In scope
– API, Worker, Eureka, Genie W/ auto-scaling group
• EC2, deploy, configure and startup component services
– AWS Elastic Application Load Balancer
– AWS Simple Notification Service
• NOT in scope
– VPC/subnets/VPC peerings
• We use fixed VPC and subnets for both VPN connections and VPC
peerings
– RDS MySQL DB
• Already pre-created
– EMR clusters
• Create by user API calls via AWS Java SDK
24. CI/CD Usecases
1. Developer edits/pushes codes to github
2. Developer deploys AE to Dev env. for tests
3. Developer terminates AE in Dev env. after tests
4. Developer deploys AE to Stg env. for integrated
tests/UAT
5. Developer deploys AE to PROD env.
6. Developer patches hotfixes and deploys to
PROD
7. Monitor your service components
25. 1. Developer edits/pushes codes to github
Developers
master
AE-100
Repo: spn/ae-saas Project: spn/ae-saas
1.19.0 3.build 4.utests 5.package
6.cp artifacts
to S3
S3: dev-us-east-1
CF templates
ae-
1.19.AE_100.jar
s
Chef recipes
ae-
1.19.AE_100.jars
1. Push
AE-100 branch
2. Trigger CI
7. cp to S3
8.publish artifacts
to mvn repo.
9. Publish
artifacts to
mvn repo.
Feature branch workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
Every commit will trigger this build
26. 2. Developer deploys AE to Dev env. for tests
Developers
Repo: spn/ae-saas Project: spn/ae-saas
4.Create CF
S3: dev-us-east-1
CF templates
ae-
1.19.AE_100.jars
Chef recipes
1. Git tag: c-1.19.AE_100-
dev-us-east-1-myAE
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF
5. CF creating for stack: ae-dev-myAE
5.1 Templates
input
6. Provision
resources
7.
Deploy/config/s
tartup service
Ready for
tests
Env.
variables
in CImaster
AE-100
27. 3. Developer terminates AE in Dev env. after tests
Developers
Repo: spn/ae-saas Project: spn/ae-saas
4.delete CF
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF
5. CF deleting for
stack: ae-dev-myAE
6. Terminating
resources
1. Git tag: d-1.19.AE_100-
dev-us-east-1-myAE
master
28. 8.1
Deploy/config/
startup service
4. Developer deploys AE to Stg env. for integrated
tests/UAT (Much like UC#2)
Developers
Repo: spn/ae-saas Project: spn/ae-saas
7.Create CF
S3: dev-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
2. Git tag: c-1.19.563-stg-
us-east-1-myAE
4. Trigger CI
Feature branch workflow
3. Push tag
Dev VPC
AWS CF
8. Provision resources
for stack: ae-stg-myAE
Ready for
tests
Env.
variables
in CImaster
AE-100
1.19.563
1. Merge feature branch:
1.19.<buildNum>
5.cp artifacts
to stg S3
●
●
●
6.1 copying
6. cp artifacts from dev to stg
9.Run itests
S3: stg-us-east-1
Run itests
on service
29. 5. Developer deploys AE to PROD env. (Much like
UC#4)
29
Much like UC#4
Git tag: c-1.19.563-prod-us-west-2-myAE
30. 6. Developer patches hotfixes and deploys to PROD
(1/2)
Developers
Repo: spn/ae-saas Project: spn/ae-saas
6.Update
CF
S3: stg-us-east-1
CF templates
ae-1.19.563.jars Chef recipes
1. Git tag: u-1.19.570-
prod-us-west-2-myAE
3. Trigger CI
Feature branch workflow
2. Push tag
Dev VPC
AWS CF
7. Update CF stack: ae-
prod-myAE
Ready to
serve
Env.
variables
in CImaster
AE-105
1.19.570
4.cp artifacts
to prod S3
●
●
●
5.1 copying
5. cp artifacts from stg to prod
S3: prod-us-west-2
8.1 Re-
Deploy/config/
startup service
31. 6. Developer patches hotfixes and deploys to PROD
(2/2)
• Updating W/O SLA impact
– ALB W/ AutoScalingReplacingUpdate for
UpdatePolicy Attribute configured
• Better and flexible Auto-scaling
– EC2 Auto-scaling group + Opsworks
• Cross region deployment as early as possible
– Minor configuration diffs
• Deploy to us-east-1 successful does not assure on others…
– AWS SDK default value is us-east-1
• You may forgot to set in your code…
31
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
https://aws.amazon.com/tw/blogs/devops/auto-scaling-aws-opsworks-instances/
(Auto-healing really sucks)
32. 7. Monitor your service components (1/2)
These are the practices we learned from other teams in Trend
• Visibility
– Operator can get the timely system status every time every where
– Practice:
• CW metrics -> CW dashboard
• CloudWatchLog -> AWS Lambda -> Log management system
• Monitoring
– Operator can setup a threshold at specific point for any metrics as a
monitor
– Therefore, the monitor can trigger corresponding actions to notify operator
– Practice:
• [App logs -> WC agent -> | custom] WC metrics -> WC Alarm
• Auto-Recovery
– System can auto recovers itself for every component runs failed
– Practice:
• EC2 auto-scaling group + Opsworks
• WC metrics -> WC Alarm -> AWS Lambda -> AWS SDK -> AWS Opsworks|AWS EC2
32
33. 7. Monitor your service components (2/2)
A high level architecture design
33
App
components
Managed
Services
AWS
CloudWatch
Default
metrics
Custom metrics
(CPU, mem, disk)
CW
metrics
CW Dashboard
CW Alarms
Pager
AWS SNS
AWS Lambda
AWS
CloudWatchLog
App logs to CWLog
Metric
filters
AWS Lambda
Input Store Process Output
Log management
Visibility
Monitoring
Visibility
AWS Lambda
Auto-recovery
34. DevOps goals V.S. our original way V.S. CI/CD on
AWS
Goals Original way CI/CD
Faster time to
market
• Too complicated to miss
steps
• Service team needs to
follow up themselves
• Lead time needed steps
(Machine resources, etc)
• One click delivery
• Only one role “developer”
• Minutes of lead time for
resources
Lower failure
rate of new
releases
• Manual steps lead to errors • Fully automation
Shorten lead
time between
fixes
• Rolling upgrade
• Invasive
• Replacing/Rolling upgrade
deployment
• Non-invasive
Faster mean
time to recovery
• Hard to deal with machine
errors and peak
• Elasticities brought from
Cloud Computing platform
https://en.wikipedia.org/wiki/DevOps#Goals
35. Lessons learned
• Try to automate everything as you can
– Cloudformation + EC2 Auto-scaling group + Opsworks
– AWS::CloudFormation::CustomResource is also a tool to rescue
• Consider to split your service CF template
– Service infra. (RDS, SNS, KMS key, etc)
• You not update your infra. often
– Service instance, (EC2, etc)
• We update our service instances very often
• Not only consider about first time creation
– How to update your services W/O impact SLA
• Monitor ! Monitor !! Monitor !!!
• TEST ! TEST !! TEST !!!
35
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-cfn-customresource.html
39. Different types of Auto-scaling group
39
Service
Auto Scaling
Group
Features Deploy
OpsWorks
24/7
•manual creation/deletion
•configure one instance for one AZ
chef recipe
time-based
•can specify time slot(s) based on hour unit,
on everyday or any day in week
•configure one instance for one AZ
chef recipe
load-based
•can specify CPU/MEM/workload avg. based
on an OPS layer
•UP: when to increase instances
•Down: when to decrease instances
•No max./min. # of instances setting
•configure one instance for one AZ
chef recipe
EC2
•can set max./min. for # of instance
•Multi-AZs support
user-data
40. Auto Recovery based on Monit
• OpsWorks already use Monit for Auto
Recovery
– Leverage the Monit on EC2
– Have practices in on-premise
11/22/201
7
Confidential | Copyright 2014
TrendMicro Inc.
2
AZ1 AZ2
API
server
API
server
https://mmonit.com/monit/
Auto Scaling group
• Instance check by
CloudWatch
• Process check by
Monit
• No process –
restart process
• Process health
check failed –
terminate EC2
• Terminate EC2 !Auto Scaling group
launch new EC2
41. Little variances among AWS regions
• Impact
– Same automation scripts can not run successfully among regions, even the
same region sometimes
• Issues
11/22/201
7
Confidential | Copyright 2014
TrendMicro Inc.
2
Service Regions Root cause
OpsWorks Same region on
us-west-2
S3 URL acceptable spec. had changed for property
“Repository URL”
From “https://s3.amazonaws.com” to “https://s3-us-
west-2.amazonaws.com”
OpsWorks us-west-2 V.S. us-
east-1
Still be “Repository URL” issue. “https://s3-us-west-
2.amazonaws.com” V.S. “https://s3.amazonaws.com”
EC2 us-west-2 V.S. us-
east-1
EC2 FQDN spec. is different.
“ip-10-104-33-152.us-west-2.compute.internal” V.S. “ip-
10-103-73-248.ec2.internal”
42. OpsWorks V.S. image-based deployment
• OpsWorks deployment
– We are currently using
– It takes too long to launch a service component
• E.g. It takes about ~10 mins to launch a Genie node
• Image-based deployment
– Theoretically, it should takes very short time to
launch a service component
– More responsive for peak workloads
– AMI (AWS Machine Images) V.S. Docker images ?
43. How about API Gateway and ECS ?
• API Gateway
– Not good due to only Internet accessible
– Cold start
– RDB connection overflow
– CORS integration for web UI
• ECS
– Still need to run standby EC2 instances for peak…
– Only take care for RESTful API services
– Kubernates more suitable for our usecases
43