The idea of "state" is the lynchpin of Terraform, and yet Terraform's workflow is fraught with gotchas that can lead to the loss or destruction of state. This doc is a set of notes about issues I've encountered, what caused them, and in many cases ideas about how to improve Terraform to avoid or reduce the chances of them.
Each of these scenarios has occured at least within my team. Each time one of these occurs it erodes people's confidence in Terraform, giving it a reputation for being fragile and unforgiving of errors. This this document is not written just to criticize but rather to identify ways in which the situation could be improved.
This one is not strictly related to Terraform State itself, but has implications for state integrity and can be solved in terms of it.
When running terraform plan -out=tfplan
, a tfplan
file is created with a serialized version of the created plan.
This plan can then be applied with terraform apply tfplan
.
Once applied, the tfplan
file is left in the local directory and can potentially be accidentally applied again.
For many changes this results in an error, but in some cases it results in duplicated resources where only the new
resources are actually tracked in the state.
One particular case where Terraform encourages mistakes is that when plan
produces an empty diff the tfplan
file
is not updated to reflect that empty diff, leaving behind the result of some previous plan. However, since Terraform
exited successfully the user (or some automated system looking at the exit status) is often tempted to run
terraform apply tfplan
anyway, at which point the stale plan is re-applied.
This gotcha could be addressed by the following improvements:
- When writing out a plan file, include in the plan the serial number of the state payload that it was derived from. Before applying the plan, verify that the current serial matches what's in the plan and fail with an error if not.
- When
plan
produces an empty diff and the-out
argument is provided, write the empty diff out to the given file so that a subsequentterraform apply
on that file will be a no-op.
When running terraform remote config
in a directory that already has a state file present, Terraform will try to
upload the current state to the newly-configured location.
If some data was already present at the new location, this data is unconditionally overwritten. If the existing data happens to be another Terraform state, that state may then be lost.
This is particularly troublesome for configurations that are intended to be deployed multiple times with different variables: one must be very careful when switching between the states for different instances of the configuration to avoid replacing one instance's state with another.
The core issue of accidentally replacing objects could be addressed by:
- Making each fresh state file contain a "lineage" property that is unique for each fresh state. (#4389)
- Making
terraform remote config
first try toRead
the configured location and, if it gets a non-error response, ensure that the retrieved data is a valid Terraform state of the same lineage as what is being written. - For extra safety: fail also if the already-stored remote state has a serial greater than the local serial.
Making this check only during
terraform remote config
would not comprehensively deal with all situations of accidentally downgrading a state, but it would catch some mistakes and there's little legitimate reason to actually downgrade a state.
For any project using remote state it's important to always run terraform remote config
to set up the remote state
before taking any other actions that interact with the state. However, it's easy to forget to do this.
If this is forgotten then running terraform apply
will likely produce a duplicate set of resources due to the
absense of a local state. If the operator panics and then tries to run terraform remote config
after this, rather
than destroying the erronously-created resources directly, the previous issue causes the "true" state to be overwritten
by the new state.
This could be addressed by:
- In the very short term, a mechanism in the Terraform configuration to indicate that remote state is required so that Terraform can refuse to run if it's not configured.
- In the longer term, allowing a specific remote configuration to be provided within the configuration, using variable interpolations to accommodate configurations that produce multiple instances depending on arguments. (#1964)
Consider the following configuration:
variable "region" {}
provider "aws" {
region = "${var.region}"
}
resource "aws_instance" "main" {
// ....
}
When this is planned the user might terraform plan -var="region=us-west-2"
to deploy the app to us-west-2, and then use us-west-1
with a separate state to deploy the same instance in that region.
In this scenario the user must be very careful to keep the state selection aligned with the region
variable. If plan
is run
with the region set to us-west-2
but the state for the us-west-1
deployment, the "Refresh" phase will look up the AWS instance
in the wrong region, see that it doesn't exist and remove it from the state before generating a diff to replace it.
The only way to recover from this is to manually revert to an earlier version of the state that had the resource instance still listed.
This one is tough to address due to Terraform's architecture but here are some ideas:
- Allow providers to mark some attributes as "resource identity attributes", and require some sort of manual resolution when they change.
- Have the AWS provider in particular remember the region that each resource was created in and only pay attention to the
provider-specified region during
Create
, withRead
,Update
andDelete
using the resource-recorded region. In this case moving a resource to another region would require tainting it.
Terraform supports running just terraform apply
as a shorthand for terraform plan -out=tfplan && terraform apply tfplan
.
This combined operation is handy when you're new to Terraform and you want to experiment, but it's generally a bad idea to do this on any real production deployment, since you don't get a chance to review what changes will be made and so you can end up inadvertently destroying important infrastructure.
I've observed people not quite understanding the flow and doing this:
terraform plan -out=tfplan
terraform apply
This appears to work and so people don't realize it's wrong, but then one day they end up applying something slightly different than what was planned.
There is a particularly awkward variation on this whose consequences are worse:
terraform plan -out=tfplan -target=something.something
terraform apply
Here the user wanted to apply just a subset of the config, but inadvertently ended up applying all of it.
There isn't any good way for Terraform to recognize and block this mistake automatically, since the plan file can be called anything and might be stale.
However, we could allow a new top-level setting in the Terraform config that allows the config author to express that this config must always be planned separately from apply:
workflow {
require_explicit_plan = true
}
When this flag is set, running terraform apply
without a plan file would generate an error:
$ terraform apply
This configuration requires an explicit separate plan step.
To create a plan, run:
terraform plan -out=tfplan
Once you've reviewed the plan and verified that it will act as expected, you can then apply it using:
terraform apply tfplan