At Slack, we use Terraform for managing our Infrastructure, which runs on AWS, DigitalOcean, NS1, and GCP. Regardless that most of our infrastructure is operating on AWS, we’ve chosen to make use of Terraform versus utilizing an AWS-native service similar to CloudFormation in order that we will use a single instrument throughout all of our infrastructure service suppliers. This retains the infrastructure-as-code syntax and deployment mechanism common. On this put up, we’ll take a look at how we deploy our infrastructure utilizing Terraform at Slack.
Evolution of our Terraform state information
Slack began with a single AWS account; all of our companies had been positioned in it. Within the early days, our Terraform state file construction was quite simple: We had a single state file per AWS area and a separate state file for world companies, similar to IAM and CloudFront.
├── aws-global
│ └── cloudfront
│ ├── companies.tf
│ ├── terraform_state.tf
│ ├── variables.tf
│ └── variations.tf
├── us-east-1
│ ├── companies.tf
│ ├── terraform.tfvars
│ ├── terraform_state.tf
│ ├── variables.tf
│ └── variations.tf
├── us-west-2
│ ├── companies.tf
│ ├── terraform.tfvars
│ ├── terraform_state.tf
│ ├── variables.tf
│ └── variations.tf
├── ap-southest-2
│ ├── companies.tf
│ ├── terraform.tfvars
│ ├── terraform_state.tf
│ ├── variables.tf
│ └── variations.tf
└── digitalocean-sfo1
├── companies.tf
├── terraform.tfvars
├── terraform_state.tf
├── variables.tf
└── variations.tf
There have been equivalent setups for each manufacturing and improvement environments, nonetheless as we began to develop, having all our AWS companies in a single account was not possible. Tens of hundreds of EC2 situations bought constructed, and we began to run into varied AWS charge limits. The AWS EC2 console was unusable in `us-east-1` the place a bulk of our workload is positioned. Additionally because of the sheer variety of groups we’ve, it was very tough to handle entry management in a single AWS account. That is after we determined to construct devoted AWS accounts for sure groups and companies. I’ve written two different weblog posts detailing this progress in Constructing the Subsequent Evolution of Cloud Networks at Slack and Constructing the Subsequent Evolution of Cloud Networks at Slack – A Retrospective.
Utilizing Jenkins as a deployment mechanism, as soon as a change is merged to a given state file, we might set off the corresponding Jenkins pipeline to deploy the change. These pipelines have two phases: one for planning and the opposite for making use of. The starting stage is used to assist us validate what the applying stage goes to do. We additionally chain these state information, so we should apply modifications within the sandbox and the event setting earlier than continuing to manufacturing environments.
Again within the day, managing the underlying Infrastructure, the Terraform code base, and the state information was the duty of the central Ops crew. As we began constructing extra little one accounts for groups at Slack, the variety of Terraform state information grew considerably. Immediately we not have a centralized Ops crew. The Cloud Foundations crew which I’m a part of is chargeable for managing the Terraform platform, and the Terraform states and pipelines have develop into the duty of service house owners. Immediately we’ve near 1,400 Terraform state information owned by totally different groups. The Cloud Foundations crew manages the Terraform variations, supplier variations, a set of instruments for managing Terraform, and quite a lot of modules that present primary performance.
Immediately we’ve a state file per area in every little one account, and a separate state file for world companies similar to IAM and CloudFront. Nonetheless we additionally have a tendency to construct separate remoted state information for bigger companies, so we will hold the variety of assets managed by a single state file to a minimal. This hurries up the deployment time and is safer to make modifications to smaller state information because the variety of impacted assets is decrease.
We use Terraform’s AWS S3 backend to retailer all our Terraform states in a version-controlled S3 bucket beneath totally different paths. We additionally make use of the state-locking and consistency-checking function backed by DynamoDB. Having the S3 bucket object versioning enabled on this bucket permits us to roll again to an older model of the state simply.
The place will we run Terraform?
All of our Terraform pipelines run on a set of devoted Jenkins employees. These employees have IAM roles connected to them with enough entry to our little one accounts in order that they’ll construct the mandatory assets. Nonetheless, engineers at Slack additionally wanted a spot to check their Terraform modifications or prototype new modules. This place has to have an equivalent setting to the Jenkins employees with out the identical degree of entry to allow the modification or creation of assets. Due to this fact we created a kind of field known as “Ops” containers at Slack, the place engineers can launch their very own “Ops” field utilizing an online interface.
They’ll select the occasion measurement, area, and disk capability earlier than launching the occasion. These situations are routinely terminated in the event that they sit idle for an prolonged interval. In the course of the provisioning of those containers, all our Terraform binaries, suppliers, wrappers, and associated instruments are arrange on these containers in order that they’ve an equivalent setting to our Jenkins employees. Nonetheless these containers solely have read-only entry to our AWS accounts. Due to this fact they permit our engineers to plan any Terraform modifications and validate the plan outputs however are unable to use them straight by way of their “Ops” containers.
How will we cope with Terraform variations?
We began off by supporting one Terraform model. We deployed the Terraform binary and plugins to our Jenkins employees and different locations the place we run Terraform, by way of a Chef configuration administration system, and we sourced the binaries from an S3 bucket.
Again in 2019, we upgraded our state information from Terraform 0.11 to 0.12. This was a significant model improve, as there have been syntax modifications between these two variations. On the time, we spent a whole quarter doing this improve. Modules needed to be copied with an `-v2` suffix to help the brand new model of Terraform. We wrote a wrapper for the Terraform binaries which might examine the `model.tf` file in every Terraform state file and select the right binary. As soon as all our state information had been upgraded, we cleaned up the 0.11 binary and the wrapper modifications that had been made. General this was a really painful and prolonged course of.
We caught with Terraform 0.12 for nearly two years earlier than contemplating upgrading to 0.13. This time round, we wished to place tooling in place to make any future modifications simpler. Nonetheless, to make issues extra difficult, the AWS supplier model 4.x was additionally launched across the similar time, and we had been utilizing model 3.74.1. There have been many breaking modifications between the AWS supplier 3.74.1 and 4.x variations. We determined to take up the problem and improve the Terraform binary and the AWS supplier on the similar time.
Regardless that there have been newer variations of Terraform accessible (0.14 and 1.x), the really useful improve path was to improve to model 0.13 after which upwards. We wished to implement a system to deploy a number of variations of Terraform binaries and plugins and select the variations we wanted based mostly on the state file. Due to this fact we launched a Terraform model config file that was deployed to every field.
{
"terraform_versions":
[
"X.X.X",
"X.X.X"
],
"package_plugins":
[
"name": "aws",
"namespace": "hashicorp",
"versions" : [
"X.X.X",
"X.X.X"
],
"registry_url": "registry.terraform.io/hashicorp/aws"
,
"title": "azurerm",
"namespace": "hashicorp",
"variations" : [
"X.X.X"
],
"registry_url": "registry.terraform.io/hashicorp/azurerm"
,
We began deploying a number of variations of the Terraform binary and the suppliers. The Terraform wrapper was up to date to learn the `variations.tf` file in every state file and select which model of the Terraform binary to make use of.
Since Terraform 0.13+ permits us to have a number of variations of the identical supplier beneath the plugin listing, we deployed all of the variations utilized by our state information to the plugin listing.
└── bin
├── registry.terraform.io
│ ├── hashicorp
│ │ ├── archive
│ │ │ └── X.X.X
│ │ │ └── linux_amd64
│ │ │ └── terraform-provider-archive
│ │ ├── aws
│ │ │ ├── X.X.X
│ │ │ │ └── linux_amd64
│ │ │ │ └── terraform-provider-aws
│ │ │ └── X.X.X
│ │ │ └── linux_amd64
│ │ │ └── terraform-provider-aws
│ │ ├── azuread
│ │ │ └── X.X.X
│ │ │ └── linux_amd64
│ │ │ └── terraform-provider-azuread
│ │ ├── azurerm
│ │ │ └── X.X.X
│ │ │ └── linux_amd64
│ │ │ └── terraform-provider-azurerm
This allowed us to improve our Terraform binary model and the AWS supplier variations on the similar time. If there have been any breaking modifications with the later model of the supplier, we might pin the supplier again to the earlier model for a given state file. Then we had been in a position to slowly improve these particular state information to the newest model of the supplier.
As soon as all of the state information had been upgraded to the newest variations of the suppliers, we eliminated all of the pined variations. This allowed the state information to make use of the newest model of a supplier accessible. We inspired service groups to keep away from pinning to a particular model of a supplier until there was a compelling purpose to take action. As new variations of suppliers develop into accessible, the Cloud Foundations crew deploy these new variations and take away any out-of-date variations.
This time round, we constructed a instrument to assist us handle our Terraform upgrades. We had been in a position to run this instrument for a given state; it does the next:
- Checks the present model
- Checks for any unapplied modifications
- Checks if there are some other state information which have remote-state lookups to the state file (Terraform 0.12 is unable to do remote-state lookups to Terraform 0.13+ state information and this will doubtlessly break some state information)
- Checks if the state file is ready to run a Terraform plan after the improve
The preliminary model of the instrument was a easy Bash script. Nonetheless as we added extra checks and logic into the script, it began to get very complicated and exhausting to comply with. Studying Terraform syntax with Bash just isn’t enjoyable, and includes a whole lot of string matching and `grep` instructions.
We ultimately ended up changing this script with a Golang binary. Golang’s hclsyntx and gohcl libraries made it a lot simpler to learn our Terraform configuration and cargo objects into Go information constructions. Golang’s terraform-exec library made it simpler to run Terraform plans and examine for errors. We additionally inbuilt state file evaluation capabilities to this binary, similar to having the ability to examine module dependency bushes.
agunasekara@ops-box:service_dev (branch_name) >> terraform-upgrade-tool -show-deps
INFO[0000] welcome to Terraform improve instrument
INFO[0000] dependency tree for this state file as follows
terraform/accnt-cloudeng-dev/us-east-1/service_dev
└── terraform/modules/aws/whitecastle-lookup
│ ├── terraform/modules/aws/aws_partition
└── terraform/modules/slack/service
└── terraform/modules/aws/alb
│ ├── terraform/modules/aws/aws_partition
└── terraform/modules/aws/aws_partition
└── terraform/modules/aws/ami
│ ├── terraform/modules/aws/aws_partition
└── terraform/modules/aws/autoscaling
This made it simpler to see which modules are impacted by a given state file improve.
As soon as we gained confidence within the instrument, we added the power to improve a share of our Terraform state information in a single run. To construct the tooling and carry out the 0.13 improve was a time-consuming challenge however it was value it as now we’re within the strategy of upgrading our Terraform model once more and this time it has been easy crusing.
How will we handle modules?
We now have our Terraform state information and modules in a single repository and use GitHub’s CODEOWNERS performance to assign opinions of a given state file to the related crew.
For our modules, we used to have the relative listing path because the module path.
module "service_dev" {
supply = "../../../modules/slack/service"
whitecastle = true
vpc_id = native.vpc_id
subnets = native.public_subnets
pvt_subnets = native.private_subnets
Regardless that this method makes testing module modifications very simple, it is usually fairly dangerous as it could actually break different state information utilizing the identical module.
Then we additionally checked out utilizing the GitHub path method.
module "community" {
supply = "git::[email protected]:slack/repo-name//module_path?ref=COMMIT_HASH"
network_cidr_ranges = var.network_cidr_ranges
Private_subnets_cidr_blocks = var.private_subnets_cidr_blocks
public_subnets_cidr_blocks = var.public_subnets_cidr_blocks
With this method, we had been in a position to pin a state file to a particular model of a module; nonetheless, there was a giant drawback. Every Terraform Plan/Apply has to clone your entire repository (keep in mind, we’ve all our Terraform code in a single repository) and this was very time consuming. Additionally Git hashes should not very pleasant to learn and evaluate.
Our personal module catalog
We wished a greater and less complicated strategy to handle our modules and due to this fact developed some inside tooling to handle this. Now we’ve a pipeline that will get triggered each time a change is merged to the Terraform repository. This pipeline checks any module modifications, and if it finds one, it’ll create a brand new Tarball of the module and add it to a S3 bucket with a brand new model.
agunasekara@ops-box:~ >> aws s3 ls s3://terraform-registry-bucket-name/terraform/modules/aws/vpc/
2021-12-20 17:04:35 5777 0.0.1.tgz
2021-12-23 12:08:23 5778 0.0.2.tgz
2022-01-10 16:00:13 5754 0.0.3.tgz
2022-01-12 14:32:54 5756 0.0.4.tgz
2022-01-19 20:34:16 5755 0.0.5.tgz
2022-06-01 05:16:03 5756 0.0.6.tgz
2022-06-01 05:34:27 5756 0.0.7.tgz
2022-06-01 19:38:21 5756 0.0.8.tgz
2022-06-27 07:47:21 5756 0.0.9.tgz
2022-09-07 18:54:53 5754 0.1.0.tgz
2022-09-07 18:54:54 2348 variations.json
It additionally uploads a file known as `variations.json`, which comprises a historical past for a given module.
agunasekara@ops-box:~ >> jq < variations.json
"title": "aws/vpc",
"path": "terraform/modules/aws/vpc",
"newest": "0.1.0",
"historical past": [
"commithash": "xxxxxxxxxxxxxxxxxxxxxxxxx",
"signature":
"Name": "Archie Gunasekara",
"Email": "[email protected]",
"When": "2021-12-21T12:04:08+11:00"
,
"version": "0.0.1"
,
"commithash": "xxxxxxxxxxxxxxxxxxxxxxxxx",
"signature":
"Name": "Archie Gunasekara",
"Email": "[email protected]",
"When": "2022-09-08T11:26:17+10:00"
,
"version": "0.1.0"
]
We additionally constructed a instrument known as `tf-module-viewer` that makes it simple for groups to checklist variations of a module.
agunasekara@ops-box:~ >> tf-module-viewer module-catalogue
Search: █
? Choose a Module:
aws/alb
aws/ami
aws/aurora
↓ aws/autoscaling
With this new module catalog method, we will now pin our modules with the trail `vendored_modules` and our Terraform binary wrappers will copy these modules from the catalog S3 path when Terraform init is run.
module "service_dev" {
supply = "../../../vendored_modules/slack/service"
whitecastle = true
vpc_id = native.vpc_id
subnets = []
pvt_subnets = native.private_subnets
The Terraform binary wrapper reads the required model of the module from a configuration file. Then it downloads the required variations of the modules to the `vendored_modules` path earlier than init-ing Terraform.
modules:
aws/alb: 0.1.0
aws/ami: 0.1.9
aws/aurora: 0.0.6
aws/eip: 0.0.7
Is that this excellent? No…
Not all our state information use this method. It’s solely applied for those with tight compliance necessities. The opposite state information nonetheless reference the modules straight utilizing a relative path within the repository. As well as, the module catalog method makes it more durable to check modifications shortly, because the module change have to be made first and uploaded to the catalog earlier than a state file can reference it.
Terraform modules can have a number of outputs and conditional assets and configurations. When a module is up to date and uploaded to the catalog, there’s a instrument known as Terraform Sensible Planner (we’ll discuss extra about this later) that may immediate the consumer to check all state information which can be utilizing this module by unpinning it. Nonetheless this isn’t enforced, and a change could break sure state information, however may fit for others. Customers of this module wouldn’t discover out about these points till they replace the pinned module model to make use of the newest. Regardless that the rollback is as simple as reverting again to an earlier model, that is nonetheless an inconvenience and a patched model of the module would then have to be uploaded to the catalog earlier than trying to utilize this newer model.
How will we construct pipelines for our Terraform state information?
As I discussed above, we use Jenkins for our Terraform deployments. With a whole lot of state information, we’ve a whole lot of pipelines and phases. We use an in-house Groovy library with Jenkins Job DSL plugins to create these pipelines. When a crew is constructing a brand new state file, they both create a model new pipeline or add this as a stage to an current pipeline. This was achieved by including a DSL script to a listing that Jenkins reads on a schedule and builds all our pipelines. Nonetheless this isn’t a really user-friendly expertise, as writing Groovy to construct a brand new pipeline is time consuming and error inclined.
Nonetheless an superior engineer in my crew named Andrew Martin used his innovation day to unravel this drawback. He constructed a small program that reads a easy YAML file and builds these complicated DSL scripts that Jenkins can use to construct its pipelines. This new method made creating new Terraform state pipelines a breeze.
pipelinename: Terraform-Deployment-rosi-org-root
steps:
- path: accnt-rosi-org-root/env-global
- path: accnt-rosi-org-root/us-east-1
- path: accnt-rosi-org-root/env-global/organizational-units-and-scps/sandbox
subsequent:
- path: accnt-rosi-org-root/env-global/organizational-units-and-scps/dev
subsequent:
- path: accnt-rosi-org-root/env-global/organizational-units-and-scps/prod-staging
subsequent:
- path: accnt-rosi-org-root/env-global/organizational-units-and-scps/prod
The configuration above will create a pipeline like under in Jenkins.
How will we check Terraform modifications?
As I discussed earlier than, every Terraform state file has a starting stage of their pipeline. This stage should have the ability to efficiently compete earlier than continuing to the apply stage. Nonetheless for this to occur modifications should already be merged to the grasp department of the Terraform repository, and sadly, if a nasty change will get merged, the pipeline is damaged and blocked. Additionally if a nasty change to a extensively used module will get merged, a number of state information could also be impacted.
To repair this, we launched a instrument known as Terraform Sensible Planner. As soon as a change is made to the Terraform repository, we will execute this instrument. Terraform Sensible Planner will search for all impacted state information by this modification, run plans towards every one, and put up the output to the pull request. The Terraform Sensible Planner works in a similar way when a module is up to date as properly. Nonetheless it’ll immediate the consumer to unpin any modules which can be pinned utilizing the module catalog as we mentioned earlier, if these modules have any modifications made to them.
Having this output on a pull request physique is extremely useful for the reviewers as they’ll see what assets are impacted by a given change. It additionally helps to find any not directly impacted state information and any modifications to the assets in them. This enables us to confidently approve a pull request and request additional modifications.
We additionally run the same CI examine for every pull request and block merges to something with damaged Terraform plans.
Some closing phrases
Our Terraform utilization is way from excellent, and there are a whole lot of enhancements we will make to enhance the consumer expertise. The Cloud Foundations crew is working intently with service groups throughout Slack to gather suggestions and make enhancements to the processes and instruments that handle our infrastructure. We now have additionally written our personal Terraform suppliers to handle our distinctive companies whereas making contributions to open-source suppliers. There’s a whole lot of thrilling work occurring on this area proper now and should you really feel like that is one thing you’d be concerned about, please keep watch over our careers page.