cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

apparentlymart · 2023-04-05T00:31:40Z

NOTE: The scope of this PR changed during discussion; see the commentary below to learn about the expanded scope.

The remaining text is the original PR writeup from before the scope increase.

We've seen some concern about the additional storage usage in Terraform Enterprise implied by creating intermediate state snapshots for particularly long apply phases that can arise when managing a large number of resource instances together in a single workspace.

This is an initial coarse approach to solving that concern, just restoring the original behavior when running inside Terraform Cloud or Enterprise for now and not creating intermediate snapshots at all.

This is here as a solution of last resort in case we do not find a better compromise before the v1.5.0 final release. Hopefully a future changeset will implement a more subtle take on this which still gets some of the benefits when running in a Terraform Enterprise environment but in a way that will hopefully be less concerning for Terraform Enterprise administrators, in which case the effects of this PR will not make it into any stable release.

This does not affect any other state storage implementation except the Terraform Cloud integration and the "remote" backend's state storage when running inside a TFC/TFE-driven remote execution environment.

apparentlymart · 2023-04-25T22:02:46Z

The two additional commits just added change the rule for the Terraform Cloud integration so that it will still persist state snapshots like any other state storage by default, but will allow the server to apply dynamic backpressure by setting a special HTTP header in its response to Create New State Version.

If the server requests an interval shorter than the default one (currently 20 seconds) then the default one will take priority. If the server requests a longer interval then that will take priority, saturating at 1 hour because a delay of longer than that isn't very reasonable and probably suggests a bug.

This doesn't change the rules for the "remote" backend because using that backend for Terraform Cloud is now deprecated anyway, and for newer versions of Terraform CLI the remote backend would never be used in a remote run in Terraform Cloud anyway. For older versions of Terraform CLI that did require the remote backend for Terraform Cloud, they would not yet have had the possibility of persisting intermediate snapshots.

apparentlymart · 2023-04-25T22:04:37Z

For the moment this is depending on an unreleased commit from github.com/hashicorp/go-tfe. We'll need to merge hashicorp/go-tfe#689 before moving forward with this, then make a new go-tfe release, and then finally change go.mod and go.sum in this PR to refer to the real release instead.

I'll leave this in draft status until we have a real go-tfe release to depend on, but otherwise I'd love for folks to take a look at this and flag any concerns so that I can respond to them concurrently with the review and release process on go-tfe.

I intend this to be complementary to rather than instead of other efforts to improve Terraform Cloud's handling of intermediate state snapshots. This change is focused on giving Terraform Cloud/Enterprise a way to protect themselves from a barrage of incoming requests, but doesn't do anything to address other concerns such as the unfortunate way Terraform Cloud presents multiple state versions per run in its UI today.

brandonc · 2023-04-26T17:59:50Z

@apparentlymart One thing that jumps out at me is that the platform assumes that the last created state version is the current workspace state version (and each get parsed for outputs and workspace resources, which become workspace artifacts)

It feels like we need the Persister API to give us the hint that the state is finalized or still pending.

apparentlymart · 2023-04-26T18:06:12Z

Hi @brandonc,

I'm not sure I see a situation where the server-side state storage should try to distinguish between an intermediate state snapshot and a final one.

What makes a final snapshot final is only that it was the most recent one. If Terraform CLI writes an intermediate snapshot and then crashes, that intermediate snapshot is the final snapshot for that run, and so I don't think it should be treated any differently than the final snapshot created after a run successfully completes. In either case the latest snapshot is the most accurate description we have of the infrastructure that exists, and so should always replace any snapshots created by earlier runs.

Does that make sense to you too, or am I missing something?

brandonc · 2023-04-26T18:34:55Z

@apparentlymart It does make sense, yes. My concern, though, is that we had aimed at not engaging certain Cloud services (state parsing mainly, which produce resource and output artifacts) until the last snapshot in the operation was completed. I suppose we could engage that stuff after x-terraform-snapshot-interval + timeout seconds after the last persistence... But it would be nice to be able to get a hint that the operation was completed so we could process the state sooner.

apparentlymart · 2023-04-26T19:07:16Z

Hmm, interesting!

I suppose I had assumed that the completion of the run in Terraform Cloud would be the trigger for that sort of thing.

By which I mean:

Whenever the "apply" step for a run reaches a terminal status -- whether success or error -- trigger whatever analysis needs to happen against the most recently-written snapshot. Terraform Cloud owns that state machine, so it should not need Terraform CLI's help to detect that event.
If a "Create State Version" requests arrives from outside of the context of a Terraform Cloud run, then we have no "framing context" for collecting multiple state snapshots together, so we'll need to assume all such snapshots are final and trigger the analysis immediately.

It seems to me that we need to make sure that we always eventually do the async analysis, even if Terraform CLI crashes in the middle of a run, because otherwise that would leave the Terraform Cloud metadata out of sync with the real latest state snapshot, right?

So what I described above is just an optimization to allow us to delay doing it when we're in control of the execution environment, because in that case we can be guaranteed that even if Terraform CLI or the Terraform Cloud Agent crashes the platform will still have an opportunity to run some final steps once the apply step is complete, and therefore we don't need to be so conservative in processing every intermediate snapshot just in case the CLI or agent crashes.

sebasslash · 2023-04-26T19:21:57Z

Correct, the tricky issue here is workspaces in local execution mode. As you stated, TFC has no "framing context" in this case so I was trying to figure out a way we could minimize the number of statefiles we'd need to store & process. It seems one potential way to do it is only process the latest state version (one created outside of a TFC run) when a workspace is unlocked. Although I feel like this might be "too late" and be a detriment to the local execution experience.

It would've really simplified things if the StateManager had context about the status of the current apply operation that was modifying state. That way we have a source of truth in order to tell TFC "hey we aren't going to create any more state versions"

apparentlymart · 2023-04-26T20:19:23Z

I agree that it would be better to avoid processing the intermediate snapshots when possible, but from what I can see it doesn't seem to be possible in all cases: if Terraform CLI crashes before it produces the final snapshot then we need to somehow retroactively decide that the most recent intermediate snapshot was the final one, because otherwise the system will become inconsistent.

I hadn't previously considered using the locking mechanism for the "run framing", but that does seem like an alternative worth investigating: even in local mode Terraform CLI will hold the workspace lock throughout the apply phase, so perhaps the rule would be that any new snapshots created while the lock is held don't trigger any analysis, but then once the lock is released we then immediately trigger the analysis of whatever is the latest snapshot at that point, if the latest snapshot wasn't already analyzed. Unless I'm missing something, that would achieve the needed behavior for both local and remote runs, because both of them involve the workspace lock.

(This does mean that if Terraform CLI crashes while holding the lock for a local operation then the workspace will stay locked and the analysis won't immediately run, but presumably it won't be too long before someone manually unlocks the workspace to continue working, and the analysis would begin at that point.)

apparentlymart · 2023-05-09T16:54:56Z

I've rebased this to include the merged version of the go-tfe change, but that isn't included in a go-tfe release yet so I cannot finalize this PR to refer to a released version.

alisdair · 2023-05-15T17:11:13Z

go-tfe v1.24.0 was just released.

Previously we just always used the same intermediate state persistence behavior for all state storages. However, some storages might have access to additional information that allows them to tailor when they persist, such as reacting to API rate limit status headers in responses, or just knowing that a particular storage isn't suited to intermediate snapshots at all for some reason. This commit doesn't actually change any observable behavior yet, but it introduces an optional means for a state storage to customize the behavior which we may make use of in certain storage implementations in future commits.

We've seen some concern about the additional storage usage implied by creating intermediate state snapshots for particularly long apply phases that can arise when managing a large number of resource instances together in a single workspace. This is an initial coarse approach to solving that concern, just restoring the original behavior when running inside Terraform Cloud or Enterprise for now and not creating snapshots at all. This is here as a solution of last resort in case we cannot find a better compromise before the v1.5.0 final release. Hopefully a future commit will implement a more subtle take on this which still gets some of the benefits when running in a Terraform Enterprise environment but in a way that will hopefully be less concerning for Terraform Enterprise administrators. This does not affect any other state storage implementation except the Terraform Cloud integration and the "remote" backend's state storage when running inside a TFC/TFE-driven remote execution environment.

Previously we just made a hard rule that the state storage for Terraform Cloud would never save any intermediate snapshots at all, as a coarse way to mitigate concerns over heightened Terraform Enterprise storage caused by saving intermediate snapshots. As a better compromise, we'll now create intermediate snapshots at the default interval unless the Terraform Cloud API responds with a special extra header field X-Terraform-Snapshot-Interval, which specifies a different number of seconds (up to 1 hour) to wait before saving the next snapshot. This will then allow Terraform Cloud and Enterprise to provide some dynamic backpressure when needed, either to reduce the disk usage in Terraform Enterprise or in situations where Terraform Cloud is under unusual load and needs to calm the periodic intermediate snapshot writes from clients. This respects the "force persist" mode so that if Terraform CLI is interrupted with SIGINT then it'll still be able to urgently persist a snapshot of whatever state it currently has, in anticipation of probably being terminated with a more aggressive signal very soon.

sebasslash

🚀 I'll go ahead and add further refinements on toggling the setting in a new PR

github-actions · 2023-05-23T22:26:01Z

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

github-actions · 2023-06-23T02:19:33Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart added enhancement cloud Related to Terraform Cloud's integration with Terraform labels Apr 5, 2023

apparentlymart self-assigned this Apr 5, 2023

vercel bot deployed to Preview April 5, 2023 00:33 View deployment

apparentlymart mentioned this pull request Apr 25, 2023

Response header hooks hashicorp/go-tfe#689

Merged

vercel bot deployed to Preview April 25, 2023 22:03 View deployment

apparentlymart requested review from a team April 25, 2023 22:04

apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from 5928870 to 6f696c3 Compare April 26, 2023 17:40

vercel bot deployed to Preview April 26, 2023 17:43 View deployment

apparentlymart changed the title ~~cloud: Skip intermediate state snapshots in Terraform Cloud/Enterprise~~ cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots May 9, 2023

apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from 6f696c3 to 12989f0 Compare May 9, 2023 16:53

apparentlymart added this to the v1.5 milestone May 9, 2023

vercel bot deployed to Preview May 9, 2023 16:58 View deployment

apparentlymart added the 1.5-backport If you add this label to a PR before merging, backport-assistant will open a new PR once merged label May 15, 2023

apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from 12989f0 to b0e91af Compare May 15, 2023 20:50

vercel bot deployed to Preview May 15, 2023 20:54 View deployment

apparentlymart marked this pull request as ready for review May 15, 2023 20:59

apparentlymart added 2 commits May 23, 2023 10:08

go.mod: upgrade github.com/hashicorp/go-tfe for response header hooks

e71fe91

apparentlymart force-pushed the f-cloud-no-intermediate-state-snapshots branch from b0e91af to 3e8a6dd Compare May 23, 2023 17:21

alisdair approved these changes May 23, 2023

View reviewed changes

vercel bot deployed to Preview May 23, 2023 17:25 View deployment

sebasslash approved these changes May 23, 2023

View reviewed changes

apparentlymart merged commit f6737d4 into main May 23, 2023

teamterraform mentioned this pull request May 23, 2023

Backport of cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots into v1.5 #33251

Merged

apparentlymart deleted the f-cloud-no-intermediate-state-snapshots branch May 23, 2023 22:40

apparentlymart removed this from the v1.5 milestone May 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

apparentlymart commented Apr 5, 2023 •

edited

Loading

apparentlymart commented Apr 25, 2023

apparentlymart commented Apr 25, 2023 •

edited

Loading

brandonc commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 •

edited

Loading

brandonc commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 •

edited

Loading

sebasslash commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 •

edited

Loading

apparentlymart commented May 9, 2023

alisdair commented May 15, 2023

sebasslash left a comment

github-actions bot commented May 23, 2023

github-actions bot commented Jun 23, 2023

cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983

Conversation

apparentlymart commented Apr 5, 2023 • edited Loading

apparentlymart commented Apr 25, 2023

apparentlymart commented Apr 25, 2023 • edited Loading

brandonc commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 • edited Loading

brandonc commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 • edited Loading

sebasslash commented Apr 26, 2023

apparentlymart commented Apr 26, 2023 • edited Loading

apparentlymart commented May 9, 2023

alisdair commented May 15, 2023

sebasslash left a comment

Choose a reason for hiding this comment

github-actions bot commented May 23, 2023

github-actions bot commented Jun 23, 2023

apparentlymart commented Apr 5, 2023 •

edited

Loading

apparentlymart commented Apr 25, 2023 •

edited

Loading

apparentlymart commented Apr 26, 2023 •

edited

Loading

apparentlymart commented Apr 26, 2023 •

edited

Loading

apparentlymart commented Apr 26, 2023 •

edited

Loading