2023-03-05

Terraform


Multi Environment Terraform

Deploying terraform is hard. Far harder than writing it. In an ideal world, you can version and deploy your terraform just like application code. However as you deploy your infrastructure to more and more environments, it becomes trickier and trickier to keep environments in a state of sync with each other. I like to think of terraform standardization in two dimensions.

Horizontal Standardization - Between different projects - This kind of standardization focuses on reusability. The native terraform modules provide a great mechanism for this, and the solution ultimately resembles the same kinds of library reuse seen in other languages.

Vertical Standardization - Between environments in the same project - This is a problem more unique to Infrastructure as Code (IaC) tools. With application code, the worst that happens is different environments have different application versions. With IaC, it’s quite possible each environment has radical differences. Some of these may be intentional, like resource sizing. The unintentional (often unknown) differences are the most dangerous, and can pose real operational risk.

In practice I’ve found Vertical Standardization to be the more unsolved problem with much sharper trade-offs. Let’s walk through a couple common terraform patterns that seem good to start, but ultimately have issues here.


Pattern 1 - Git Branch Strategy

This is a pattern I often see used with gitflow processes. The main idea goes:

There are quite a lot of variations on this pattern, like triggering deploys off of tags, etc. I also want to stress that this pattern works great if you only have one or two environments to worry about. If that’s you currently, awesome. Close this page, come back when you have more environments and more headaches.

Even with only two environments though, there are a few issues that start to rise up:

Now none of the above are horrible, and can often be mitigated with talented folks and good CI/CD tooling. But we start to see the cracks where this pattern may not scale as well for more than 2-3 environments.


Pattern 2 - Env Folders

This is a pattern I’ve seen to try and scale out to multiple environments. The main idea goes:

This is both a pattern hashicorp recommends, while also acknowledging the main limitation.

Directory-separated environments rely on duplicate Terraform code. This may be useful if you want to test changes in a development environment before promoting them to production. However, the directory structure runs the risk of creating drift between the environments over time.

Despite the main issues of configuration drift, this is the first pattern that can scale to multiple environments, and truly scale well. It’s easy to have hundreds of folders maintaining their own terraform configuration. The issue becomes keeping them in sync. This pattern works great when you’re not concerned about drift between environments. This pattern has tradeoffs:


Pattern 3 - Deploy Repo

This is the pattern I’ve landed on the last few years for deploying application infrastructure. It tries to take the best of both of the above patterns. I find it reduces far more complexity than it adds. It goes like:

In practice the centralized deploy repo may look like the following:

dev/
  foo-service/
    manifest.yaml
  bar-service/
    manifest.yaml
stage/
  foo-service/
    manifest.yaml
  bar-service/
    manifest.yaml
prod/
  foo-service/
    manifest.yaml
  bar-service/
    manifest.yaml

I’ve found it easiest to organize first by environment, then by application. CICD tooling often works easier on file structure than file content, so organizing like this often allows for defaults to be created per env. Each manifest file contains a short description of what is being deployed.

# dev/foo-service/manifest.yaml
repo: foo-service
version: v1.2.3
vars:
  env_name: dev
  instance_size: m5.large
  flag_new_feature: true

This manifest file becomes pretty trivial for any CI/CD script to process and perform the relevant git clone, provide terraform state configuration, and then terraform plan/apply with the provided variables.

It’s also certainly possible just to use vanilla terraform root modules here instead, where each root module has one and only one submodule. With this it’s just too tempting to add just one resource outside the module that an environment needs. And suddenly environments are out of sync, and there is infrastructure for an application existing outside of it’s source repo.

# Don't recommend this in practice
module "foo_service" {
  source = "github.com/me/foo-service?version=1.2.3"

  env_name         = dev
  instance_size    = m5.large
  flag_new_feature = true
}

One tool I do find very useful here is terragrunt. In fact, all my manifest.yaml’s are actually terragrunt.hcl files in practice. It’s nice, because terragrunt will let you just run terragrunt plan/apply as if that file was the terraform itself.

# Point to the code/version you want to deploy
terraform {
  source = "github.com/me/foo-service?version=1.2.3"
}

# Provide inputs
inputs = {
  env_name         = dev
  instance_size    = m5.large
  flag_new_feature = true
}

# Just have terragrunt generate the backend config
generate "backend" {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"
  contents = <<EOF
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "my-lock-table"
  }
}
EOF
}
}

I want to stress terragrunt is not a requirement of this pattern. I just find it follows a similar philosophy, and makes life easy. It’s definitely possible just to drive pipelines strictly from the yaml manifests.

Let’s look at how this pattern handles some of the common concerns.

Alright, but what doesn’t this pattern do well?


Final Thoughts

The hardest part of terraform is still usually deploying it at scale to a large number of environments. The third pattern presented here has served me very well the past few years, and moves this issue slightly more towards “solved” for me.

I’ve used it at organizations with hundreds of projects for 1000s of manifest files, and it’s scaled beautifully.

What I’m starting to realize more and more is that this pattern isn’t just limited to terraform though. Declarative deployments are a standard across a huge number of IaC tools. It’s more a philosophy than a feature. And in most of these cases, the deploy repo provides the same benefits it does for terraform. I think it’s quite possible to combine multiple tooling into one unified deploy system, where terraform.manifest.yaml, serverless.manifest.yaml, argo.manifest.yaml, pulumi.manifest.yaml, etc.manifest.yaml files are all understood by the centralized system, and managed similarly.