A talk by Matthew Revell from OpenCredo. Discussing some of the problems with designing infrastructure deployments with Terraform.
Sometimes the most obvious and logical approach does not yield the desired outcome. Often the nature and culture of an organisation needs to be considered when structuring infrastructure deployments.
10. Consider whether a
single resource
module adds any
value
Consider whether
the additional
complexity is worth
the perceived value
Consider whether
the module will be
usable by the
intended
consumer(s)
Value Complexity Usability
Lessons Learned (small modules)
11. pray the code works the second time
Run Once And Forget
16. Testing modules in
isolation can only
validate the
internals
Full deployment
tests are essential
to validate the
entire Terraform
structure
A dedicated
account can allow
continuous testing
without disruption
Testing Testing Testing
Lessons Learned (running code once)
17. or in this case Terraform States
How To Slice The Cake
20. Resources should
be grouped such
that states do not
grow exponentially
States should have
a limited scope to
minimise impact in
the event of
mistakes
Teams should be
able to manage
their own Terraform
independently
Scalability Blast Radius Ownership
Lessons Learned (dividing states)
27. ● Thin wrapper for Terraform
● Allows for easier management of backends
● Reduces amount of repeated code
● Developed by Gruntworks
Terragrunt
Image courtesy of Gruntworks Inc.
32. ● Generic templating engine
● Generate output in YAML or JSON
● Uses jsonnet, jinja2 (and more)
● Developed by Deepmind
Kapitan
Image courtesy of Deepmind (Alphabet Inc.)
33. Repeated code and
copy and pasting
will definitely lead
to mistakes
A lack of clarity and
readability will also
lead to confusion
and mistakes
Tooling can help
maintain clean code
in complex
deployments
Keep it DRY Clarity Tooling
Lessons Learned (repo structures)
PERSONAL PAST EXPERIENCES
CURRENTLY work for a company called OPENCREDO as a SENIOR CONSULTANT
OC is a SOFTWARE DEVELOPMENT CONSULTANCY
We SPECIALISE in …
CLOUD NATIVE ARCHITECTURE
DATA ENGINEERING
HELP OUR CLIENTS to ADAPT and to ADOPT new TECHNOLOGIES and PRACTICES
I myself STARTED MY CAREER as a SYSTEMS ENGINEER
It was my job to MANAGE SERVERS
AS TIME AND TECHNOLOGY MOVED ON ...
I deal a lot more with DEVOPS TOOLS and MANAGING CLOUD PLATFORMS
Rather than MANAGING SERVERS
When I FIRST STARTED managing CLOUD INFRASTRUCTURE
Things were relatively PRIMITIVE
MOSTLY still done using BASH and PYTHON… (same as before)
SOME PLATFORM SPECIFIC TOOLS SUCH AS CLOUDFORMATION
DIFFICULT TO WORK WITH
Then around 2015, I started using TERRAFORM
It was a GAME CHANGER
This was AROUND THE TIME Terraform 0.6 was released
Given that the LATEST RELEASE IS 0.12, I guess it was about HALF WAY THERE
This was an INTERESTING TIME for Terraform
No WAY TO DEFINE STATE BACKEND IN CODE
No IMPORT command
No STATE command
No NATIVE STATEFILE LOCKING
Many NOT-SO-FUNNY incidents…
When SOMEONE DESTROYED PRODUCTION because they’d configured the wrong backend
When a REFACTORING PROCESS was a MARATHON of MANUALLY EDITING state files
THANKFULLY those days are BEHIND US
If you’ve just started using Terraform, you’ll never know the horror!
I’m aware that makes me sound like someone’s grandad going on about “how things were in my day”, but it happens to be true
HOWEVER there are MANY problems I’VE EXPERIENCED which I THINK STILL APPLY
When I FIRST USED TERRAFORM
LARGE COMPANY
Went through the ARCHETYPAL JOURNEY
AS SEEN ON YOUTUBE / AT HASHICONF
STARTED with a SINGLE FILE and STATE
STARTED BREAKING DOWN
Into MULTIPLE STATES and REUSABLE MODULES
Modules DEVELOPED as a CENTRAL TEAM
Modules INTENDED to be CONSUMED by INTERNAL TEAMS
Approach GENERALLY WORKED REALLY WELL
MODULES FOR
VPC
ELK STACK
COMMON APP PATTERNS
THEN MODULES STARTED GETTING SMALLER
Started making MODULES for SINGLE RESOURCES
EC2 INSTANCE
LOAD BALANCER
AUTOSCALE GROUP
Done for a SPECIFIC GOAL; TAGGING
SEEMED like a NEAT SOLUTION
PROVIDE PIECEMEAL BUILDING BLOCKS
Problem:
DIDN”T WORK AS PLANNED, WHEN TEAMS TRIED TO ADOPT THINGS GOT COMPLICATED
Deployments had SO MANY MODULES
TERRAFORM INIT TAKING AGES
Modules NESTED several LAYERS DEEP
Modules needed to REFERENCE EACH OTHER
NESTING made this very DIFFICULT
REMINISCENT of NORMAL FORMS in DATABASES
THERE IS A POINT where the OVERHEAD and COMPLEXITY is no longer worth it
Fix:
Company also had AUDITING SOFTWARE
CHECKED each ACCOUNT and sent REPORTS (NAGGING EMAILS) to TEAM LEADS / HEADS OF TECH
No REAL NEED to enforce through MODULES
VALUE
SMALL modules CAN BE USEFUL, but NEED TO ADD VALUE
E.G. SECURITY GROUP RULE modules
COMPLEXITY
IS IT WORTH THE PERCEIVED VALUE
USABILITY
INTENDED CONSUMER
Setup:
If we only run code once, can we be sure it will work again?
Once I worked at a company, where we BUILT a SERVICE
Setup (cont):
Done RIGHT with TERRAFORM
And THEN we built A SECOND SERVICE
Setup (cont):
And we BUILT it RIGHT TOO, with TERRAFORM
And then we DECIDED they needed to TALK TO EACH OTHER
Setup (cont):
We UPDATED the TERRAFORM and EVERYTHING worked FINE
CLEAN TERRAFORM PLAN
Problem:
One day, we DISCOVERED that the ENVIRONMENT had been COMPROMISED
COMPANY POLICY was that the ENVIRONMENT should be DECOMMISSIONED and a NEW ENVIRONMENT CREATED
It was at about THIS TIME that we DISCOVERED that our TERRAFORM DIDN’T WORK FROM SCRATCH
EACH SERVICE was in its OWN STATE, and REFERENCED SG GROUP ID from the others REMOTE STATE
FINE when BOTH ALREADY EXISTED
BOTH FAILED when NEITHER EXISTED
A -> B -> A
ULTIMATELY we needed MORE TESTING
TERRATEST was NOT ENOUGH
Needed to TEST FULL DEPLOYMENT
DEDICATED ACCOUNT FOR IAC TESTING
Setup:
A COMMON EXERCISE; SEGMENT MODULES / STATES
We DECIDED to SPLIT STATES based on TYPE OF RESOURCE
So we had a STATE for
NETWORKING resources
SECURITY resources
Etc.
It looked A LITTLE BIT LIKE THIS
In reality a bit MORE COMPLICATED
TO START WITH this was absolutely FINE
EVERYTHING was in a logical PLACE, we knew WHERE to look to UPDATE our DEPLOYMENT
Where THINGS started to BREAK DOWN was when THE COMPANY STARTED GROWING
We were CREATING new TEAMS and SERVICES almost CONSTANTLY
Every TEAM needed IAM ROLES, S3 BUCKET for ARTIFACTS, amongst other things
Every SERVICE needed DNS RECORDS, a DATABASE, S3 BUCKET etc. etc.
Problem:
UPDATE 5-6 Terraform STATES
STATES GROWING in size with each ADDITION, making UPDATES SLOWER
BLAST RADIUS
Our DATABASE STATE contained ALL the DATABASES, so a MISTAKE could AFFECT MANY SERVICES
FEW PEOPLE responsible for ALL THE TERRAFORM
Fix:
We NEEDED a bit of a RETHINK… and we REALISED
THINGS we built once per
ENVIRONMENT, TEAM, SERVICE
So we REFACTORED our STATES and MODULES to look a bit more like this...
Fix (cont):
This WORKED VERY WELL
CREATED NEW ENVIRONMENT with several STATES, but once we’d provisioned, those were mostly left alone
If a NEW TEAM was being created we had a TEMPLATE and CREATED A NEW STATE for each, which again remained mostly static once provisioned
The same for a NEW SERVICE, but even better, once a TEAM was set up, they could MANAGE THEIR OWN TERRAFORM
This also meant that MISTAKES would only AFFECT ONE specific SERVICE
SCALABILITY
States should not grow exponentially
BLAST RADIUS
States should have limited scope to minimise impact of mistakes
OWNERSHIP
Teams should be able to manage Terraform independently
Setup:
This is a problem that hasn’t happened JUST ONCE, it come up pretty much EVERY TIME
And it’s one I don’t think has REALLY BEEN SOLVED YET
How SHOULD we STRUCTURE our DEPLOYMENT REPOS
Setup (cont):
Here’s a typical REPO LAYOUT, where the code for DEV and PROD is pretty much the same
SIMPLIFIED but main point is that DEV and PROD contain the same files
The main.tf may LOOK LIKE THIS...
ONLY difference between DEV and PROD -> VERSION NUMBERS
And the TERRAFORM.TFVARS SIMPLY HAS...
ASSIGNING the VARIABLE
And the BACKEND.TF...
ONLY DIFFERENCE is the KEY
IF THIS SIMPLE…. OK
IS AT LEAST CLEAR, but not DRY
Problem:
The MAIN ISSUE with this is that UPDATES WENT INTO DEV, and then copied over to PROD
OVER THE YEARS I’ve seen MANY ATTEMPTS to RESOLVE…
GIT BRANCH
SYMLINKING
WORKSPACES; HARD TO SEE HOW MANY-> WHAT IS DEPLOYED IN EACH -> NOT INTENDED FOR ISOLATED ENVS
Fix:
ONE SOLUTION THAT WORKED QUITE WELL WAS A TOOL CALLED TERRAGRUNT
TERRAGRUNT
Thin WRAPPER for TERRAFORM
EASIER management of TERRAFORM BACKENDS
REDUCES amount of REPEATED CODE
DEVELOPED BY GRUNTWORKS INC.
So with TERRAGRUNT the repo would look like this
Significantly FEWER FILES
NOTE, the common.tfvars
DEFINE BACKEND ONCE DYNAMICALLY
TG HAS BUILT IN FUNCS TO FILL BACKEND
SET DEFAULT FOR VARIABLE
AND THEN IN EACH terraform.tfvars ...
IMPORT common.tfvars
DEFINE SOURCE
OVERRIDE VARIABLES FOR SPECIFIC ENV
The ONLY ISSUE we experienced was that TERRAGRUNT EXPECTED ONE MODULE PER STATE
THIS WAS NOT HOW WE WANTED TO WORK
We WORKED AROUND that with some META MODULES
WASN’T IDEAL but it solved most of the issues
TERRAGRUNT UPDATED NOW SUPPORTS MULTI MODULES
IT ALSO HAS A NICE ‘apply-all’ FEATURE
ONE MORE TOOL HONOURABLE MENTION...
KAPITAN
GENERIC TEMPLATING TO AUTO-GENERATE CODE
OUTPUT IN YAML OR JSON
USES JSONNET AND JINJA
Developed by DEEPMIND
USEFUL IF USE
KUBERNETES +
TERRAFORM +
ANSIBLE +
PACKER
TRADE OFF: COMPLEXITY
KEEP IT DRY
REPEATED CODE WILL LEAD TO ERRORS
CLARITY
ALSO IMPORTANT; LACK CAUSES CONFUSION
TOOLING
LEVERAGE TOOLING
The SCENARIOS I’ve DISCUSSED covering
SHAPE and SIZE of MODULE
CONFIDENCE in RUNNING DEPLOYMENTS
STRUCTURING STATES
STRUCTURING DEPLOYMENT REPOS
KEY THEMES
FIRSTLY, SAME AS SOFTWARE DEVELOPMENT
TESTING + CLEAN CODE
MORE IMPORTANTLY...
PROBLEMS often arise due to ARCHITECTURAL DECISIONS; DON’T CONSIDER THE NATURE OF THE ORG + CHANGES
MOST IMPORTANT LESSON LEARNED IN MY EXPERIENCE AND MISADVENTURES WITH TERRAFORM
NO DEFINITIVE LOGICAL PRINCIPAL
The SHAPE, SIZE, and CULTURE of an ORGANISATION are IMPORTANT
CONSIDER THESE WHEN DESIGNING DEPLOYMENT STRUCTURE