Infrastructure Management with Atlantis

Diagram: Rebecca Kilberg

Diagram: Rebecca Kilberg

At Truss, we follow a standard pattern for managing AWS infrastructure with Terraform: use remote state, keep infrastructure code under version control, submit pull requests (PRs) with proposed code changes, include the terraform plan output in the PR comment, and wait for a teammate to review and approve the PR before applying the changes. The process works well, but it can be time-consuming.

Early last year, we used Atlantis to automate the Terraform workflow on a client project. Using Atlantis proved to be so beneficial that we started using Atlantis to manage our own AWS environment and introduced it onto another project that needed to limit access to their production infrastructure. Those experiences convinced us that Atlantis can be useful to any project, regardless of team size or access requirements. We wanted to share why we came to that conclusion and some of the patterns we’ve developed.

But first...

What is Atlantis?

See the Atlantis site for a great visual of the Atlantis Workflow.

At a high level, Atlantis is an application that responds to Terraform pull requests via webhooks and automates the terraform init / plan / apply workflow. Instead of running terraform commands locally, engineers check their code into version control and open a PR. From there, Atlantis generates a terraform plan and posts it to the PR. Engineers approve the PR and type “atlantis apply” in a PR comment. Atlantis then applies the changes and posts a record of the changes back to the PR.

It sounds straightforward, but it’s incredibly useful.

Benefits

The following are the key benefits we’ve enjoyed while working with Atlantis:

  • Increased Productivity. Our standard projects include multiple AWS accounts (see terraform-layout-example). A single PR can involve running terraform plan in each directory, copying and pasting the plan output into the PR comment, and then applying locally before merging the PR. Automating that process with Atlantis immediately improves productivity.

  • Updated Plans. Terraform files are often modified during code review; maybe an engineer updates the original plan, maybe not. With Atlantis, any change to the terraform files will generate an updated plan, so reviewers know exactly which changes will be applied.

  • Audit Trail. Sometimes our clients want to keep track of changes made. With Atlantis and GitHub, we have a built-in log of who made which change, when, and why, not to mention who approved that change.

  • Secure Workflow. Atlantis can be configured to require that a PR be approved and mergeable before the changes can be applied. Our project teams follow this pattern even when not using Atlantis, but Atlantis adds an extra layer of protection that prevents engineers from accidentally applying changes before they’re approved.

  • Access Control. By having Atlantis apply infrastructure changes, there is less need for engineers to have broad access to the cloud environment. There are still times when engineers need to apply terraform changes, such as when bootstrapping an account or deploying Atlantis, but apart from that, they’re able to manage most routine updates by interacting with Atlantis via PR comments.

  • Happy Engineers. Last but not least, Atlantis removes toil from an engineer’s daily workflow. Less toil means happier, more productive engineers, which translates directly to better project outcomes.

Limitations

Atlantis excels at handling the routine terraform init / plan / apply workflow, but it is not designed to handle more complex tasks, like importing terraform resources or running terraform state commands. Those operations still require manual intervention. We mention this not to discourage anyone from using Atlantis, but to set realistic expectations about its capabilities.

Patterns

We’ve found the following patterns to be effective when working with Atlantis. They may not be applicable to all projects, and some teams may choose alternative approaches, but we’ll briefly explain why we think these patterns are beneficial.

  • Deploy Atlantis as a Terraform module. We use the terraform-aws-atlantis module to run Atlantis on AWS Fargate, which follows our general pattern of using ECS/Fargate to deploy web applications.

  • Lock Down the Atlantis Endpoint. Atlantis runs as a web application with an endpoint to receive webhook notifications and to view locks created by each plan. By default, the endpoint is publicly accessible. We recommend restricting ingress so that only your version control system (VCS) and engineers can access it. There are multiple ways to achieve this, such as limiting access based on VCS provider IPs or internal network IP ranges. We developed a solution to enable authentication at the Application Load Balancer level using Amazon Cognito and SAML, and contributed back to the module to add this functionality in v2.17.0.

  • Pin the Image Version. Unless you set the atlantis_image variable, the Atlantis terraform module will use the atlantis:latest image from Docker Hub. This default is not recognized as updated when a new "latest" is released. Therefore, we recommend you pass in a numbered version to the atlantis_image variable.

  • Custom Atlantis Image. In some cases, we need to build a custom image on top of the Atlantis base image rather than use the default Docker Hub image. This provides more configuration flexibility and integrates with CI tools for running image vulnerability scans.

  • One Atlantis Server per Infrastructure Repository. Historically, we’ve maintained the infrastructure code for multi-account projects in one repository. Atlantis can manage multiple accounts using cross-account roles, but some projects prefer to have different Atlantis servers manage different sets of accounts. For example, one server to manage the production account and another to manage non-production accounts, or one server to manage commercial accounts and another to manage GovCloud accounts. Atlantis does not currently provide a great solution for having multiple Atlantis servers each manage different accounts within a single repository, so to handle the multi-Atlantis scenario, we’ve chosen to adopt a multi-repo approach: each set of accounts managed by an Atlantis server is placed into a single code repository, along with the Atlantis server itself. In our experience, this approach simplifies the Atlantis deployment, with only minor additional overhead of managing multiple code repositories.

  • Allow Admins to Assume the Atlantis Role. When using an Atlantis server to manage multiple AWS accounts, you’ll need to configure cross-account IAM roles for Atlantis and include the “role_arn” in the provider and S3 backend blocks as described in the Atlantis documentation. Engineers without permission to assume the role will receive an error when running terraform commands locally. If your intent is to limit engineer access, this may be the desired outcome. However, to support those situations where an admin still needs to run a terraform command locally, we configure the Atlantis IAM role to be assumable by the admin IAM users.

  • Enumerate Directories Managed by Atlantis. We don’t use Atlantis to manage every account or directory on a project. For instance, we don’t use it to manage the directory where its own infrastructure is deployed, nor do we use it to manage an account’s bootstrap directory, which uses local state. Consequently, we enumerate each directory managed by Atlantis in the repo level atlantis.yaml config.

  • Separate Code Repositories for Internal Modules. See our terraform-layout-example repository for a further discussion of this topic.

  • Separate User for Each Atlantis Instance. We create a separate GitHub user for each Atlantis server, only grant it access to the repositories it needs to manage, and frequently rotate its credentials. This reduces the potential risk if an Atlantis user’s credentials are compromised.

Implementation

You may be wondering how much of a lift it is to deploy Atlantis or how difficult it is to integrate with existing infrastructure. Provided you’re already using remote state to manage your Terraform configuration, introducing Atlantis onto a project can be done by deploying the terraform-aws-atlantis module and configuring the IAM roles Atlantis will use to manage the infrastructure. Using Atlantis does not require migrating state and does not lead to vendor lock-in. If you decide later that you no longer need or want to use Atlantis, you can decommission the Atlantis server and revert to how you were managing the Terraform code previously.

If you’re new to Atlantis, you likely still have several questions about how to deploy and configure it. We recommend visiting the Atlantis site and reviewing their documentation for additional information. We also encourage you to read and thoroughly understand the security implications described there.

Hopefully this post has provided some valuable tips and information about Atlantis. As we continue to introduce it on more projects, we may provide additional updates or blog posts in the future. You can also visit our Engineering Playbook, where we discuss some of the implementation details and gotchas we’ve encountered.