tfe_workspace_run resource's wait behavior is unreliable #898

nfagerlund · 2023-05-23T22:03:46Z

Terraform Cloud/Enterprise version

main

Terraform version

latest, doesn't matter.

Terraform Configuration Files

From the TestAccTFEWorkspaceRun_withApplyOnlyBlock test:

resource "tfe_workspace_run" "ws_run_parent" {
    workspace_id    = "<SOME_ID>"

    apply {
        manual_confirm = false
    }
}

resource "tfe_workspace_run" "ws_run_child" {
    workspace_id    = "<SOME ID>"
    depends_on   = [tfe_workspace_run.ws_run_parent]

    apply {
        manual_confirm = false
    }

    destroy {
        manual_confirm = false
    }
}

Test failure output

=== Failed
=== FAIL: tfe TestAccTFEWorkspaceRun_withApplyOnlyBlock (33.13s)
    resource_tfe_workspace_run_test.go:33: Step 1/1 error: Error running apply: exit status 1
        
        Error: run errored while applying run run-UbHRYDycYKSJT8tU (waited til status planned, currently status cost_estimating): transition not allowed
        
          with tfe_workspace_run.ws_run_parent,
          on terraform_plugin_test.tf line 3, in resource "tfe_workspace_run" "ws_run_parent":
           3: 	resource "tfe_workspace_run" "ws_run_parent" {
        

DONE 98 tests, 36 skipped, 1 failure in 328.060s
Error: Process completed with exit code 1.

Expected Behavior

The TestAccTFEWorkspaceRun_withApplyOnlyBlock test for the tfe_workspace_run resource should be reliably passing.

Actual Behavior

Instead, it fails randomly!

After adding some extra logging, it's looking like this indicates a legit logic error in the resource, which we didn't catch during review and testing. It only happens rarely!

In TFC's run state machine, certain states like :planned or :post_plan_completed can act as EITHER:
- A semi-final resting state awaiting confirmation.
- An incredibly brief transitory state en route to one of the other things that happens between a plan or an apply.
In the resource, we're only treating those states as the former.
If we happen to poll during the exact instant that a run passes through the transitory form of a maybe-resting state, we try to confirm and apply when it's not allowed.

So, we need to await a more reliable signal that the run is ready to rumble. Possibly r.Actions.IsConfirmable?

The text was updated successfully, but these errors were encountered:

nfagerlund · 2023-05-23T23:55:30Z

Ok, I did some research and here is my conclusion:

The current state of the TFC runs API basically sucks for "determining whether a run has reached a resting state yet or not." Pretty much every integration that does this ends up having to exploit an unseemly amount of internal knowledge about TFC's rails code.
This random race condition is a fundamental problem with the "two sets of statuses" design we inherited from multispace. So we can't fix it by only shuffling the sets around.
For a "quick" fix, I think we should refactor awaitRun to accept a done func(r *tfe.Run) (bool, error) argument, instead of two status map-sets. We can probably refactor the policy override, manual confirm, and apply waits into done-functions that just check the two sets (PROBABLY), but for the plan wait it'll look something more like this:
- if r.Actions.IsConfirmable, return (true, nil)
- else if the run's in one of the following definitely-done statuses, return (true, nil):
  - tfe.RunPlannedAndFinished
  - tfe.RunPolicyOverride
  - tfe.RunErrored
  - tfe.RunCanceled
- else if the run's in one of the states we sort of expect it might be in at this point, return (false, nil)
- else, we don't know what happened so return (false, fmt.Errorf("yo!!"))
For a longer term fix, maybe we should go mess with the run API and make it a lot easier to figure out what's going on with a run, without having to keep on top of the constantly-growing and strangely interacting lists of run statuses! We could add a pair of "is-finished" and "needs-attention" bools to the attributes (such that false + false means TFC's still working on something), and there's probably a couple good points in the run state machine to properly update them — I have my eye on things like confirm_or_finish! or handle_confirmable_transition or the after_transaction hooks for final statuses.

lucymhdavies · 2023-05-26T18:31:56Z

I've also noticed some issues which seems related.

Code in my case...

resource "tfe_workspace_run" "trigger_workspaces" {

...

  # Kick off an apply, but don't wait for it
  apply {
    wait_for_run   = false # Fire and forget
    manual_confirm = false # N/A if wait_for_run is false, but parameter is required
  }

First thing that is interesting here is that manual_confirm is still a required parameter, even if wait_for_run is false. This I'm assuming is just a limitation of how we write Terraform providers. I don't know if we can have some parameters only be conditionally mandatory. I'm not too bothered by that though tbh.

What actually is a problem however is that it looks like wait_for_run = false is being ignored here. What I was expecting here is a wait_for_run = false run to be fire-and-forget. i.e. start an Apply, and then don't care what happens next.

And when that happens, we get errors like this:

╷
│ Error: run errored while applying run run-PBHA5KP1vELeWepK (waited til status post_plan_running, currently status policy_checking): transition not allowed
│
│   with tfe_workspace_run.trigger_workspaces["webserver-azure-dev"],
│   on workspaces.tf line 207, in resource "tfe_workspace_run" "trigger_workspaces":
│  207: resource "tfe_workspace_run" "trigger_workspaces" {
│
╵

I can grab more detailed TF_LOG=debug logs later as needed. Like I say, the transition not allowed error is intermittent, so I guess it depends on what the API returns at the time, and this part is what you've already identified.

In cases where we do not get the error, the resource actually does confirm the apply itself.

To be fair, that IS expected behaviour for when manual_confirm = false, so I'm not worried about that. The problem is that wait_for_run isn't being respected.

(I can raise a separate issue for this if you'd prefer)

nfagerlund · 2023-05-26T19:06:23Z

Oh hmm!!! 🤔 I wonder if...

Ok, so it looks like we only check wait_for_run AFTER we await the plan and do the branch on manual_confirm. So really all wait_for_run = false actually means right now is "don't wait for the apply." I think we maybe didn't consider some combinations of these as closely as we should have, and we might want to add a way to set auto-approve for the run if you REALLY want to fire-and-forget...

@lucymhdavies I think this is a separate issue, if you're willing to file one! It might benefit from a grid of expected behavior for the combinations of attrs. :D

lucymhdavies · 2023-05-26T19:11:08Z

I was just gonna make a comment to that effect. I figured that out too :)

https://github.com/hashicorp/terraform-provider-tfe/blob/main/tfe/workspace_run_helpers.go#L146-L154

Yeah, I'll raise a separate issue. Please hold :)

lucymhdavies · 2023-07-04T15:18:19Z

With #921 merged and released, is this issue resolved, or do we have some additional work to do still?

(I thought for a second it was still broken, but then realised I was still using an old local build of the provider. With v0.46.0, while my testing has been far from extensive... seems not to be an issue anymore. Or, if it is, then it's at least far less frequent)

Uk1288 · 2023-07-05T13:48:24Z

Most of the expected run statuses and transitions have been added so this issue should happen very minimally whenever a missing status is encountered. These missing statuses will be added as discovered pending the more concrete run API solution.

nfagerlund added the bug label May 23, 2023

lucymhdavies mentioned this issue May 26, 2023

use tfe_workspace_run instead of multispace FancyCorp-Demo/tfc-org-setup#2

Closed

lucymhdavies mentioned this issue May 26, 2023

tfe_workspace_run resource checks wait_for_run too late #908

Closed

Uk1288 mentioned this issue Jun 6, 2023

Incorporate missing post plan terminal statuses to ensure plan is completely processed before attempting to confirm run #921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfe_workspace_run resource's wait behavior is unreliable #898

tfe_workspace_run resource's wait behavior is unreliable #898

nfagerlund commented May 23, 2023 •

edited

Loading

nfagerlund commented May 23, 2023

lucymhdavies commented May 26, 2023

nfagerlund commented May 26, 2023

lucymhdavies commented May 26, 2023

lucymhdavies commented Jul 4, 2023 •

edited

Loading

Uk1288 commented Jul 5, 2023

tfe_workspace_run resource's wait behavior is unreliable #898

tfe_workspace_run resource's wait behavior is unreliable #898

Comments

nfagerlund commented May 23, 2023 • edited Loading

Terraform Cloud/Enterprise version

Terraform version

Terraform Configuration Files

Test failure output

Expected Behavior

Actual Behavior

nfagerlund commented May 23, 2023

lucymhdavies commented May 26, 2023

nfagerlund commented May 26, 2023

lucymhdavies commented May 26, 2023

lucymhdavies commented Jul 4, 2023 • edited Loading

Uk1288 commented Jul 5, 2023

nfagerlund commented May 23, 2023 •

edited

Loading

lucymhdavies commented Jul 4, 2023 •

edited

Loading