Building for the Future with Terraform

construction-4794328_1280.png

Here at Truss we use Terraform extensively for deploying infrastructure, especially in AWS. Over the years, we’ve found a lot of traps both of our own creation and in clients’ infrastructure, and developed patterns that we use to avoid running into trouble in the future. In this blog post, I want to talk about the traps we’ve seen and the patterns we use to avoid problems as we build out infrastructure.

Avoid Using Terraform Workspaces by Default

Terraform provides a feature called Workspaces which allows you to define the infrastructure for a root module once, then use the terraform workspace command to change the .tfvars file being used to populate values in the Terraform configuration. Note that this is different from a workspace in Terraform Cloud/Enterprise, so the following advice does not apply to those.

On the surface, this would seem to be an elegant solution to deploying multiple environments and making sure they are exactly the same. After all, if you deploy your dev environment, stage environment, and prod environment with exactly the same code, just different variables, you can be sure that they should be as similar as possible.

Unfortunately, this is also the problem with workspaces. If you force your development and production environments to use the same exact code, how do you test new versions of your infrastructure? Let’s say you want to change your application to be backed by a different type of datastore, or add in some new AWS resource that was just released. How do you do that in your development environment without doing it in production at the same time? While there are ways to do it with ternary operators, dynamic resources, or other tricks, those can introduce unwanted complexity or cause resources to be renamed, setting you up for potentially wiping out production resources if you’re not careful.

A common solution we’ve heard to this problem is just not pushing production until you’re ready to update it to the new version of the infrastructure, but this means leaving things in a state where your infrastructure as code does not match the desired configuration in reality for hours, days, or even longer -- which violates the entire principle of infrastructure as code. If you need to run Terraform before you’re ready, like making a change to mitigate an ongoing incident, you’re potentially setting yourself up for disaster.

In addition, there’s also just the problem of not realizing what workspace you’re in and running a terraform apply that does something unexpected. Avoiding workspaces doesn’t eliminate that possibility, but it certainly seems to make it less likely in our experience.

Instead, we use a standard layout for our Terraform repositories. This keeps our different environments and stacks logically separate, making it possible for us to ensure that the code describes the actual AWS resources. To keep environments in sync, we make extensive use of Terraform modules to describe components of our stacks.

Speaking of modules…

Use Separate Repos and Versioning for Terraform Modules

We use modules extensively at Truss. You can find many of our general-purpose ones on the Terraform Registry, but we also use a variety of internal modules to deploy our applications. For a while now, we’ve put these in separate repositories with names that follow the registry conventions, e.g. terraform-aws-my-app-backend, rather than putting them in a modules directory inside our main infrastructure repository. There’s a number of reasons for doing this.

First, and most importantly, it means that we can version our modules. We can have the dev environment use v2 of the my-app-backend module and production use v1 as long as we need to in order to test v2, and then make a deliberate change to update production to v2 when we’re ready. If your modules are just in your central Terraform repo, any change to your modules means it affects every instance of that module in your infrastructure all at once -- leaving you in a similar position as you would be with workspaces as I described above. By using separate repos and versioned module calls, we can ensure that our Terraform repo is always canonical for every part of our infrastructure.

You can accomplish this versioning in a number of ways; the easiest is with the ref component of the source parameter in the module call, which allows you to specify a tag or git SHA for the module like so:

module “my_app_backend_dev” {
     source = “git@github.com/trussworks/terraform-aws-my-app-backend.git?ref=v1”
}

This is also the reason we recommend separate repos for all your modules, rather than a single repo with all your modules in it. If you have a separate repo for all your modules, you know that there is a one-to-one mapping between git SHAs and versions of your Terraform modules. With multiple modules in the same repo, you may end up with multiple git SHAs that point to the same version of the code (because someone changed another module somewhere), which can be extremely confusing.

You can also do this by using a private Terraform registry; at Truss, we’ve not done this simply because this would end up being overkill for most of our projects, and the benefits over the above method are not particularly great. If you’re in an environment with a large number of developers, it might be worth your while.

Separating the repos out also provides a number of other benefits. If you want to open source your modules in the future, this makes it easy; Truss does this extensively with our own modules. You can also more easily segregate access to different modules, which can be important if, say, you only want your database engineers to be able to change your database terraform module.

There is one case where moving modules out of your main Terraform repo isn’t a good idea though: if you have many, many calls (think dozens or hundreds) to the same module in the same root module, having that module in a separate repo can make running Terraform in that root module excruciatingly slow. In that case, it may be a good idea to keep that particular module inside the repo, called with a relative path. Thanks to my former coworker Sarguru for highlighting this edge case on Twitter.

Make Terraform Modules Self-Sufficient

Terraform modules are a great way to avoid repeating yourself, but every level of abstraction has the potential to obscure what is actually going on and also makes your upgrade path more complicated. If you overuse Terraform modules, you can get to the point where every time you want to upgrade, you have to unwind a matryoshka doll of Terraform modules to perpetuate your changes.

In order to solve the problem, we try to make sure that our internal modules do not have multiple levels of dependencies -- an internal module should generally only include external, generic modules (usually from the public registry) and raw Terraform resources. We want to avoid having an internal module that calls another internal module (that inevitably ends up calling another internal module).

This also means that the scope of our modules is an important consideration; instead of a single module that deploys an entire stack, for instance, we may have multiple modules (say a frontend, backend, and database module trio) that deploy a stack. Those can then be tied together by keeping all their calls in a single Terraform file in the root module. Optimally, each stack will have its own root module, so even if we don’t encapsulate it in a module, it is still clear where the boundaries of the stack are.

Keep Root Modules Tightly Constrained

Terraform root modules (also sometimes called namespaces) are slices of Terraform code that correspond to a single statefile. In Terraform code, this corresponds to a single directory with a single backend definition. In our experience, it’s important to keep individual Terraform root modules tightly constrained; if your root modules are too large, you end up with issues like:

  • Running a Terraform plan or apply against the root module can take an extremely long time, or on the extreme end, run into API rate limits.

  • Multiple engineers trying to make changes to the same root module run into conflicting locks, making it harder to test and deploy changes.

  • The potential blast radius for a corrupted statefile or a bad change that slips through the cracks can affect vast swaths of your infrastructure -- Charity Majors wrote a blog post on this topic in 2016. While Hashicorp has improved the tools to fix issues like this since Charity’s post, it is still a potential headache to keep in mind.

What’s the right size for your root module? That’s a difficult question to answer. In general though, here’s a few guidelines:

  • Keep things that are not related separate. For instance, in our example Terraform layout, note that every account has an “admin-global” namespace that controls our account-wide resources, like IAM users or AWS log buckets. These are logically separate from application infrastructure, so we don’t want it sharing a root module with that infrastructure.

  • Anything you want to isolate for security or safety purposes should be a separate namespace. You’ll notice that our “bootstrap” namespace, which holds the S3 bucket and DynamoDB lock table for Terraform itself in every account, is a separate root module from everything else. Breaking these resources potentially means bad things for the whole account, so we want to make it harder for people to accidentally break something with them.

  • Environments or isolated application components should not share a root module. Don’t put your dev environment and your prod environment in the same root module, and if you have two applications that don’t share any components, don’t have them share a root module either.

If you get to the point where you need to split up root modules because they are getting too large, try not to do it along arbitrary lines. People need to be able to know where to find the resources they want to change, and if they are presented with “namespace-a”, “namespace-b”, and “namespace-c” with no easily discernible difference between them, that is going to be confusing.

Use -target Sparingly

One of the strongest “infrastructure smells” for us is when projects start using Terraform’s -target option as a matter of course. If you find yourself using -target as part of your regular workflow, it probably means that you’ve got other problems that you need to address.

Using -target from time to time, like using the terraform state commands, is almost unavoidable. However, when it starts to be something people are using every day, this usually means one of two things: either your namespaces are too large, as we described above, or your engineers do not trust that they can just run a blanket terraform apply safely. That latter issue is actually much worse than the former; left unaddressed, the problem will compound itself, exacerbating drift and undermining your engineers trust in the validity of your Terraform -- which means that your infrastructure as code isn’t doing its job!

If you find yourself using -target regularly, consider taking a step back and asking yourself why this is the case, and think about whether there are more systemic issues you need to address.

Minimize Terraform Drift and Make Changes Fast and Easy

I’ve touched on this topic through this post, but it’s worth calling this out specifically, because drift is one of the most corrosive forces on your infrastructure. Just having the drift itself is bad enough -- it lays out a minefield for you or your coworkers to run into -- but what’s worse is that it undermines your team’s confidence in your Terraform code and makes people less likely to trust it, especially in high-pressure situations like the middle of an incident. Unfortunately, those situations are often the times when being able to trust your infrastructure as code is most important.

One of the reasons drift can arise is because doing infrastructure changes “the right way” is an onerous task. Consider a case I’ve seen a number of times before, where a team has their application code and their Terraform code in the same git repository and uses the same method to merge in Terraform changes as they do for their application changes -- down to running the full suite of application tests against every Terraform change, which might take up to an hour. If that’s the case, it is going to be hard to use that when time is tight -- so people will make changes via the AWS CLI or console, and then backport them to Terraform. Only they don’t always do that, so now when people run Terraform they see a bunch of changes they didn’t anticipate, so they are unsure whether they can safely proceed.

As a result, when we see these kinds of patterns, our advice is usually some combination of the following:

  • Break infrastructure code from application code. Yes, you can configure your tests to run more selectively, but that’s usually more difficult than just breaking the repo apart. Breaking the two apart also makes it much more difficult to create dependencies between your Terraform code and your application code; in our experience, tightly coupling these two ends up causing a great deal of pain in the long run. In addition, you also make it easier to segregate permissions to read or commit code to the two repositories, something that is a common requirement in tightly regulated environments.

  • Use something like Atlantis to automate Terraform changes and smooth out the approval/apply process. Atlantis allows you to handle changes to multiple root modules easily, prevents conflicting PRs, enforces the approval process, and keeps your infrastructure consistent. Drift is much more difficult when your default method of applying changes enforces your code. (Terraform Cloud/Enterprise provide a similar workflow, but we haven’t used it extensively at Truss.)

  • Pairing a tool like renovate or dependabot with Atlantis makes it easy to make sure that you are using up to date versions of Terraform providers and public Terraform modules. These tools will automatically generate a new PR and Atlantis will generate a plan, making it easy to see if there will be any breaking changes from picking up the new version, and if not, they can be quickly applied.

The upshot of these patterns is to encourage, as much as possible, a culture where the Terraform code is canonical -- that is, where if something is not in the Terraform code, then it should be expected that it will not be persistent, and making changes to infrastructure via Terraform is the default method.

On the extreme end of this, I’ve seen some people suggest running a regular job that applies all their Terraform code every day to enforce this, and I’m not sure I’d go that far (the potential for very bad things to happen is high) -- but getting a report of drift every day might be a valuable exercise. I’d caution that if your drift is significant, it might be best to cut it down before turning on any kind of notifications about it, however. Otherwise, you’re likely to numb people to the idea of having lots of drift before you’ve even started tackling the problem.

Avoid Circular Dependencies

This is something that tends to come up when your Terraform infrastructure starts to get more complicated, but it can show up much earlier too. Terraform makes it easy to create circular dependencies as your infrastructure grows incrementally that will not bite you until you try to deploy it all from scratch again.

The continually evolving nature of modern infrastructure means we may create a module A to start building our service, and then a month later we create a module B that consumes an output from module A (such as a security group to allow communication). Six months after that, we add some new feature for module A that consumes an output or resource that module B creates (maybe an SQS queue).

Now we have a situation where module A depends on module B, which depends on module A. This isn’t a problem in steady state -- but if we try to stand this application up in another region, or if there’s a catastrophic failure and we have to rebuild this infrastructure from scratch, we’re going to run into problems. Unfortunately, it’s also difficult to see these when they slowly accumulate over time, making it easy to get surprised by this at the worst time.

We can try to avoid the problem by adopting informal conventions; the module for the datastore should never consume something from the backend module, which should never consume something from the frontend module, but as they say, they end up being more like guidelines than hard and fast rules, so it tends to creep in anyway. 

The best way to push back against this issue is to regularly exercise your code by trying to stand up a new environment from scratch; while this might seem like an onerous task, doing these sorts of exercises repeatedly will not only highlight your circular dependencies but help you update documentation and realize where the process could be simplified or automated. You may even be able to get it to the point where the whole process can be handled with your CI/CD tools.

Wrapping Up

Infrastructure as code is key to building secure and reliable infrastructure, but just like any code, its utility depends on how well it is built and maintained. By using the above techniques with your Terraform, you can make it easy for your engineers to use it for as many of your infrastructure changes as possible, reducing the temptation to use out of band methods for changes and the drift that creates, and you encourage the constant refinement of your code at the same time.