-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud: Allow Terraform Cloud/Enterprise to apply backpressure to intermediate state snapshots #32983
Conversation
The two additional commits just added change the rule for the Terraform Cloud integration so that it will still persist state snapshots like any other state storage by default, but will allow the server to apply dynamic backpressure by setting a special HTTP header in its response to Create New State Version. If the server requests an interval shorter than the default one (currently 20 seconds) then the default one will take priority. If the server requests a longer interval then that will take priority, saturating at 1 hour because a delay of longer than that isn't very reasonable and probably suggests a bug. This doesn't change the rules for the "remote" backend because using that backend for Terraform Cloud is now deprecated anyway, and for newer versions of Terraform CLI the remote backend would never be used in a remote run in Terraform Cloud anyway. For older versions of Terraform CLI that did require the remote backend for Terraform Cloud, they would not yet have had the possibility of persisting intermediate snapshots. |
For the moment this is depending on an unreleased commit from I'll leave this in draft status until we have a real go-tfe release to depend on, but otherwise I'd love for folks to take a look at this and flag any concerns so that I can respond to them concurrently with the review and release process on go-tfe. I intend this to be complementary to rather than instead of other efforts to improve Terraform Cloud's handling of intermediate state snapshots. This change is focused on giving Terraform Cloud/Enterprise a way to protect themselves from a barrage of incoming requests, but doesn't do anything to address other concerns such as the unfortunate way Terraform Cloud presents multiple state versions per run in its UI today. |
5928870
to
6f696c3
Compare
@apparentlymart One thing that jumps out at me is that the platform assumes that the last created state version is the current workspace state version (and each get parsed for outputs and workspace resources, which become workspace artifacts) It feels like we need the Persister API to give us the hint that the state is finalized or still pending. |
Hi @brandonc, I'm not sure I see a situation where the server-side state storage should try to distinguish between an intermediate state snapshot and a final one. What makes a final snapshot final is only that it was the most recent one. If Terraform CLI writes an intermediate snapshot and then crashes, that intermediate snapshot is the final snapshot for that run, and so I don't think it should be treated any differently than the final snapshot created after a run successfully completes. In either case the latest snapshot is the most accurate description we have of the infrastructure that exists, and so should always replace any snapshots created by earlier runs. Does that make sense to you too, or am I missing something? |
@apparentlymart It does make sense, yes. My concern, though, is that we had aimed at not engaging certain Cloud services (state parsing mainly, which produce resource and output artifacts) until the last snapshot in the operation was completed. I suppose we could engage that stuff after |
Hmm, interesting! I suppose I had assumed that the completion of the run in Terraform Cloud would be the trigger for that sort of thing. By which I mean:
It seems to me that we need to make sure that we always eventually do the async analysis, even if Terraform CLI crashes in the middle of a run, because otherwise that would leave the Terraform Cloud metadata out of sync with the real latest state snapshot, right? So what I described above is just an optimization to allow us to delay doing it when we're in control of the execution environment, because in that case we can be guaranteed that even if Terraform CLI or the Terraform Cloud Agent crashes the platform will still have an opportunity to run some final steps once the apply step is complete, and therefore we don't need to be so conservative in processing every intermediate snapshot just in case the CLI or agent crashes. |
Correct, the tricky issue here is workspaces in local execution mode. As you stated, TFC has no "framing context" in this case so I was trying to figure out a way we could minimize the number of statefiles we'd need to store & process. It seems one potential way to do it is only process the latest state version (one created outside of a TFC run) when a workspace is unlocked. Although I feel like this might be "too late" and be a detriment to the local execution experience. It would've really simplified things if the StateManager had context about the status of the current apply operation that was modifying state. That way we have a source of truth in order to tell TFC "hey we aren't going to create any more state versions" |
I agree that it would be better to avoid processing the intermediate snapshots when possible, but from what I can see it doesn't seem to be possible in all cases: if Terraform CLI crashes before it produces the final snapshot then we need to somehow retroactively decide that the most recent intermediate snapshot was the final one, because otherwise the system will become inconsistent. I hadn't previously considered using the locking mechanism for the "run framing", but that does seem like an alternative worth investigating: even in local mode Terraform CLI will hold the workspace lock throughout the apply phase, so perhaps the rule would be that any new snapshots created while the lock is held don't trigger any analysis, but then once the lock is released we then immediately trigger the analysis of whatever is the latest snapshot at that point, if the latest snapshot wasn't already analyzed. Unless I'm missing something, that would achieve the needed behavior for both local and remote runs, because both of them involve the workspace lock. (This does mean that if Terraform CLI crashes while holding the lock for a local operation then the workspace will stay locked and the analysis won't immediately run, but presumably it won't be too long before someone manually unlocks the workspace to continue working, and the analysis would begin at that point.) |
6f696c3
to
12989f0
Compare
I've rebased this to include the merged version of the go-tfe change, but that isn't included in a go-tfe release yet so I cannot finalize this PR to refer to a released version. |
12989f0
to
b0e91af
Compare
Previously we just always used the same intermediate state persistence behavior for all state storages. However, some storages might have access to additional information that allows them to tailor when they persist, such as reacting to API rate limit status headers in responses, or just knowing that a particular storage isn't suited to intermediate snapshots at all for some reason. This commit doesn't actually change any observable behavior yet, but it introduces an optional means for a state storage to customize the behavior which we may make use of in certain storage implementations in future commits.
We've seen some concern about the additional storage usage implied by creating intermediate state snapshots for particularly long apply phases that can arise when managing a large number of resource instances together in a single workspace. This is an initial coarse approach to solving that concern, just restoring the original behavior when running inside Terraform Cloud or Enterprise for now and not creating snapshots at all. This is here as a solution of last resort in case we cannot find a better compromise before the v1.5.0 final release. Hopefully a future commit will implement a more subtle take on this which still gets some of the benefits when running in a Terraform Enterprise environment but in a way that will hopefully be less concerning for Terraform Enterprise administrators. This does not affect any other state storage implementation except the Terraform Cloud integration and the "remote" backend's state storage when running inside a TFC/TFE-driven remote execution environment.
Previously we just made a hard rule that the state storage for Terraform Cloud would never save any intermediate snapshots at all, as a coarse way to mitigate concerns over heightened Terraform Enterprise storage caused by saving intermediate snapshots. As a better compromise, we'll now create intermediate snapshots at the default interval unless the Terraform Cloud API responds with a special extra header field X-Terraform-Snapshot-Interval, which specifies a different number of seconds (up to 1 hour) to wait before saving the next snapshot. This will then allow Terraform Cloud and Enterprise to provide some dynamic backpressure when needed, either to reduce the disk usage in Terraform Enterprise or in situations where Terraform Cloud is under unusual load and needs to calm the periodic intermediate snapshot writes from clients. This respects the "force persist" mode so that if Terraform CLI is interrupted with SIGINT then it'll still be able to urgently persist a snapshot of whatever state it currently has, in anticipation of probably being terminated with a more aggressive signal very soon.
b0e91af
to
3e8a6dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 I'll go ahead and add further refinements on toggling the setting in a new PR
Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch. |
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions. |
NOTE: The scope of this PR changed during discussion; see the commentary below to learn about the expanded scope.
The remaining text is the original PR writeup from before the scope increase.
We've seen some concern about the additional storage usage in Terraform Enterprise implied by creating intermediate state snapshots for particularly long apply phases that can arise when managing a large number of resource instances together in a single workspace.
This is an initial coarse approach to solving that concern, just restoring the original behavior when running inside Terraform Cloud or Enterprise for now and not creating intermediate snapshots at all.
This is here as a solution of last resort in case we do not find a better compromise before the v1.5.0 final release. Hopefully a future changeset will implement a more subtle take on this which still gets some of the benefits when running in a Terraform Enterprise environment but in a way that will hopefully be less concerning for Terraform Enterprise administrators, in which case the effects of this PR will not make it into any stable release.
This does not affect any other state storage implementation except the Terraform Cloud integration and the "remote" backend's state storage when running inside a TFC/TFE-driven remote execution environment.