Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

Merged
merged 4 commits into from
May 23, 2023

Conversation

apparentlymart
Copy link
Contributor

@apparentlymart apparentlymart commented Apr 5, 2023

NOTE: The scope of this PR changed during discussion; see the commentary below to learn about the expanded scope.

The remaining text is the original PR writeup from before the scope increase.


We've seen some concern about the additional storage usage in Terraform Enterprise implied by creating intermediate state snapshots for particularly long apply phases that can arise when managing a large number of resource instances together in a single workspace.

This is an initial coarse approach to solving that concern, just restoring the original behavior when running inside Terraform Cloud or Enterprise for now and not creating intermediate snapshots at all.

This is here as a solution of last resort in case we do not find a better compromise before the v1.5.0 final release. Hopefully a future changeset will implement a more subtle take on this which still gets some of the benefits when running in a Terraform Enterprise environment but in a way that will hopefully be less concerning for Terraform Enterprise administrators, in which case the effects of this PR will not make it into any stable release.

This does not affect any other state storage implementation except the Terraform Cloud integration and the "remote" backend's state storage when running inside a TFC/TFE-driven remote execution environment.

@apparentlymart apparentlymart added enhancement cloud Related to Terraform Cloud's integration with Terraform labels Apr 5, 2023
@apparentlymart apparentlymart self-assigned this Apr 5, 2023
@apparentlymart
Copy link
Contributor Author

The two additional commits just added change the rule for the Terraform Cloud integration so that it will still persist state snapshots like any other state storage by default, but will allow the server to apply dynamic backpressure by setting a special HTTP header in its response to Create New State Version.

If the server requests an interval shorter than the default one (currently 20 seconds) then the default one will take priority. If the server requests a longer interval then that will take priority, saturating at 1 hour because a delay of longer than that isn't very reasonable and probably suggests a bug.

This doesn't change the rules for the "remote" backend because using that backend for Terraform Cloud is now deprecated anyway, and for newer versions of Terraform CLI the remote backend would never be used in a remote run in Terraform Cloud anyway. For older versions of Terraform CLI that did require the remote backend for Terraform Cloud, they would not yet have had the possibility of persisting intermediate snapshots.

@apparentlymart
Copy link
Contributor Author

apparentlymart commented Apr 25, 2023

For the moment this is depending on an unreleased commit from github.com/hashicorp/go-tfe. We'll need to merge hashicorp/go-tfe#689 before moving forward with this, then make a new go-tfe release, and then finally change go.mod and go.sum in this PR to refer to the real release instead.

I'll leave this in draft status until we have a real go-tfe release to depend on, but otherwise I'd love for folks to take a look at this and flag any concerns so that I can respond to them concurrently with the review and release process on go-tfe.

I intend this to be complementary to rather than instead of other efforts to improve Terraform Cloud's handling of intermediate state snapshots. This change is focused on giving Terraform Cloud/Enterprise a way to protect themselves from a barrage of incoming requests, but doesn't do anything to address other concerns such as the unfortunate way Terraform Cloud presents multiple state versions per run in its UI today.

@brandonc
Copy link
Contributor

@apparentlymart One thing that jumps out at me is that the platform assumes that the last created state version is the current workspace state version (and each get parsed for outputs and workspace resources, which become workspace artifacts)

It feels like we need the Persister API to give us the hint that the state is finalized or still pending.

@apparentlymart
Copy link
Contributor Author

apparentlymart commented Apr 26, 2023

Hi @brandonc,

I'm not sure I see a situation where the server-side state storage should try to distinguish between an intermediate state snapshot and a final one.

What makes a final snapshot final is only that it was the most recent one. If Terraform CLI writes an intermediate snapshot and then crashes, that intermediate snapshot is the final snapshot for that run, and so I don't think it should be treated any differently than the final snapshot created after a run successfully completes. In either case the latest snapshot is the most accurate description we have of the infrastructure that exists, and so should always replace any snapshots created by earlier runs.

Does that make sense to you too, or am I missing something?

@brandonc
Copy link
Contributor

@apparentlymart It does make sense, yes. My concern, though, is that we had aimed at not engaging certain Cloud services (state parsing mainly, which produce resource and output artifacts) until the last snapshot in the operation was completed. I suppose we could engage that stuff after x-terraform-snapshot-interval + timeout seconds after the last persistence... But it would be nice to be able to get a hint that the operation was completed so we could process the state sooner.

@apparentlymart
Copy link
Contributor Author

apparentlymart commented Apr 26, 2023

Hmm, interesting!

I suppose I had assumed that the completion of the run in Terraform Cloud would be the trigger for that sort of thing.

By which I mean:

  • Whenever the "apply" step for a run reaches a terminal status -- whether success or error -- trigger whatever analysis needs to happen against the most recently-written snapshot. Terraform Cloud owns that state machine, so it should not need Terraform CLI's help to detect that event.
  • If a "Create State Version" requests arrives from outside of the context of a Terraform Cloud run, then we have no "framing context" for collecting multiple state snapshots together, so we'll need to assume all such snapshots are final and trigger the analysis immediately.

It seems to me that we need to make sure that we always eventually do the async analysis, even if Terraform CLI crashes in the middle of a run, because otherwise that would leave the Terraform Cloud metadata out of sync with the real latest state snapshot, right?

So what I described above is just an optimization to allow us to delay doing it when we're in control of the execution environment, because in that case we can be guaranteed that even if Terraform CLI or the Terraform Cloud Agent crashes the platform will still have an opportunity to run some final steps once the apply step is complete, and therefore we don't need to be so conservative in processing every intermediate snapshot just in case the CLI or agent crashes.

@sebasslash
Copy link
Contributor

Correct, the tricky issue here is workspaces in local execution mode. As you stated, TFC has no "framing context" in this case so I was trying to figure out a way we could minimize the number of statefiles we'd need to store & process. It seems one potential way to do it is only process the latest state version (one created outside of a TFC run) when a workspace is unlocked. Although I feel like this might be "too late" and be a detriment to the local execution experience.

It would've really simplified things if the StateManager had context about the status of the current apply operation that was modifying state. That way we have a source of truth in order to tell TFC "hey we aren't going to create any more state versions"

@apparentlymart
Copy link
Contributor Author

apparentlymart commented Apr 26, 2023

I agree that it would be better to avoid processing the intermediate snapshots when possible, but from what I can see it doesn't seem to be possible in all cases: if Terraform CLI crashes before it produces the final snapshot then we need to somehow retroactively decide that the most recent intermediate snapshot was the final one, because otherwise the system will become inconsistent.

I hadn't previously considered using the locking mechanism for the "run framing", but that does seem like an alternative worth investigating: even in local mode Terraform CLI will hold the workspace lock throughout the apply phase, so perhaps the rule would be that any new snapshots created while the lock is held don't trigger any analysis, but then once the lock is released we then immediately trigger the analysis of whatever is the latest snapshot at that point, if the latest snapshot wasn't already analyzed. Unless I'm missing something, that would achieve the needed behavior for both local and remote runs, because both of them involve the workspace lock.

(This does mean that if Terraform CLI crashes while holding the lock for a local operation then the workspace will stay locked and the analysis won't immediately run, but presumably it won't be too long before someone manually unlocks the workspace to continue working, and the analysis would begin at that point.)

@apparentlymart apparentlymart changed the title cloud: Skip intermediate state snapshots in Terraform Cloud/Enterprise cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots May 9, 2023
@apparentlymart apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from 6f696c3 to 12989f0 Compare May 9, 2023 16:53
@apparentlymart
Copy link
Contributor Author

I've rebased this to include the merged version of the go-tfe change, but that isn't included in a go-tfe release yet so I cannot finalize this PR to refer to a released version.

@alisdair
Copy link
Contributor

@apparentlymart apparentlymart added the 1.5-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged label May 15, 2023
@apparentlymart apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from 12989f0 to b0e91af Compare May 15, 2023 20:50
@apparentlymart apparentlymart marked this pull request as ready for review May 15, 2023 20:59
Previously we just always used the same intermediate state persistence
behavior for all state storages. However, some storages might have access
to additional information that allows them to tailor when they persist,
such as reacting to API rate limit status headers in responses, or just
knowing that a particular storage isn't suited to intermediate snapshots
at all for some reason.

This commit doesn't actually change any observable behavior yet, but it
introduces an optional means for a state storage to customize the behavior
which we may make use of in certain storage implementations in future
commits.
We've seen some concern about the additional storage usage implied by
creating intermediate state snapshots for particularly long apply phases
that can arise when managing a large number of resource instances together
in a single workspace.

This is an initial coarse approach to solving that concern, just restoring
the original behavior when running inside Terraform Cloud or Enterprise
for now and not creating snapshots at all.

This is here as a solution of last resort in case we cannot find a better
compromise before the v1.5.0 final release. Hopefully a future commit
will implement a more subtle take on this which still gets some of the
benefits when running in a Terraform Enterprise environment but in a way
that will hopefully be less concerning for Terraform Enterprise
administrators.

This does not affect any other state storage implementation except the
Terraform Cloud integration and the "remote" backend's state storage when
running inside a TFC/TFE-driven remote execution environment.
Previously we just made a hard rule that the state storage for Terraform
Cloud would never save any intermediate snapshots at all, as a coarse way
to mitigate concerns over heightened Terraform Enterprise storage caused
by saving intermediate snapshots.

As a better compromise, we'll now create intermediate snapshots at the
default interval unless the Terraform Cloud API responds with a special
extra header field X-Terraform-Snapshot-Interval, which specifies a
different number of seconds (up to 1 hour) to wait before saving the next
snapshot.

This will then allow Terraform Cloud and Enterprise to provide some dynamic
backpressure when needed, either to reduce the disk usage in Terraform
Enterprise or in situations where Terraform Cloud is under unusual load
and needs to calm the periodic intermediate snapshot writes from clients.

This respects the "force persist" mode so that if Terraform CLI is
interrupted with SIGINT then it'll still be able to urgently persist
a snapshot of whatever state it currently has, in anticipation of probably
being terminated with a more aggressive signal very soon.
@apparentlymart apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from b0e91af to 3e8a6dd Compare May 23, 2023 17:21
Copy link
Contributor

@sebasslash sebasslash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 I'll go ahead and add further refinements on toggling the setting in a new PR

@apparentlymart apparentlymart merged commit f6737d4 into main May 23, 2023
@github-actions
Copy link

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
1.5-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged cloud Related to Terraform Cloud's integration with Terraform enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants