Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfe_workspace_run resource's wait behavior is unreliable #898

Open
nfagerlund opened this issue May 23, 2023 · 6 comments
Open

tfe_workspace_run resource's wait behavior is unreliable #898

nfagerlund opened this issue May 23, 2023 · 6 comments
Labels

Comments

@nfagerlund
Copy link
Member

nfagerlund commented May 23, 2023

Terraform Cloud/Enterprise version

main

Terraform version

latest, doesn't matter.

Terraform Configuration Files

From the TestAccTFEWorkspaceRun_withApplyOnlyBlock test:

resource "tfe_workspace_run" "ws_run_parent" {
    workspace_id    = "<SOME_ID>"

    apply {
        manual_confirm = false
    }
}

resource "tfe_workspace_run" "ws_run_child" {
    workspace_id    = "<SOME ID>"
    depends_on   = [tfe_workspace_run.ws_run_parent]

    apply {
        manual_confirm = false
    }

    destroy {
        manual_confirm = false
    }
}

Test failure output

=== Failed
=== FAIL: tfe TestAccTFEWorkspaceRun_withApplyOnlyBlock (33.13s)
    resource_tfe_workspace_run_test.go:33: Step 1/1 error: Error running apply: exit status 1
        
        Error: run errored while applying run run-UbHRYDycYKSJT8tU (waited til status planned, currently status cost_estimating): transition not allowed
        
          with tfe_workspace_run.ws_run_parent,
          on terraform_plugin_test.tf line 3, in resource "tfe_workspace_run" "ws_run_parent":
           3: 	resource "tfe_workspace_run" "ws_run_parent" {
        

DONE 98 tests, 36 skipped, 1 failure in 328.060s
Error: Process completed with exit code 1.

Expected Behavior

The TestAccTFEWorkspaceRun_withApplyOnlyBlock test for the tfe_workspace_run resource should be reliably passing.

Actual Behavior

Instead, it fails randomly!

After adding some extra logging, it's looking like this indicates a legit logic error in the resource, which we didn't catch during review and testing. It only happens rarely!

  • In TFC's run state machine, certain states like :planned or :post_plan_completed can act as EITHER:
    • A semi-final resting state awaiting confirmation.
    • An incredibly brief transitory state en route to one of the other things that happens between a plan or an apply.
  • In the resource, we're only treating those states as the former.
  • If we happen to poll during the exact instant that a run passes through the transitory form of a maybe-resting state, we try to confirm and apply when it's not allowed.

So, we need to await a more reliable signal that the run is ready to rumble. Possibly r.Actions.IsConfirmable?

@nfagerlund nfagerlund added the bug label May 23, 2023
@nfagerlund
Copy link
Member Author

Ok, I did some research and here is my conclusion:

  • The current state of the TFC runs API basically sucks for "determining whether a run has reached a resting state yet or not." Pretty much every integration that does this ends up having to exploit an unseemly amount of internal knowledge about TFC's rails code.
  • This random race condition is a fundamental problem with the "two sets of statuses" design we inherited from multispace. So we can't fix it by only shuffling the sets around.
  • For a "quick" fix, I think we should refactor awaitRun to accept a done func(r *tfe.Run) (bool, error) argument, instead of two status map-sets. We can probably refactor the policy override, manual confirm, and apply waits into done-functions that just check the two sets (PROBABLY), but for the plan wait it'll look something more like this:
    • if r.Actions.IsConfirmable, return (true, nil)
    • else if the run's in one of the following definitely-done statuses, return (true, nil):
      • tfe.RunPlannedAndFinished
      • tfe.RunPolicyOverride
      • tfe.RunErrored
      • tfe.RunCanceled
    • else if the run's in one of the states we sort of expect it might be in at this point, return (false, nil)
    • else, we don't know what happened so return (false, fmt.Errorf("yo!!"))
  • For a longer term fix, maybe we should go mess with the run API and make it a lot easier to figure out what's going on with a run, without having to keep on top of the constantly-growing and strangely interacting lists of run statuses! We could add a pair of "is-finished" and "needs-attention" bools to the attributes (such that false + false means TFC's still working on something), and there's probably a couple good points in the run state machine to properly update them — I have my eye on things like confirm_or_finish! or handle_confirmable_transition or the after_transaction hooks for final statuses.

@lucymhdavies
Copy link
Contributor

I've also noticed some issues which seems related.

Code in my case...

resource "tfe_workspace_run" "trigger_workspaces" {

...

  # Kick off an apply, but don't wait for it
  apply {
    wait_for_run   = false # Fire and forget
    manual_confirm = false # N/A if wait_for_run is false, but parameter is required
  }

First thing that is interesting here is that manual_confirm is still a required parameter, even if wait_for_run is false. This I'm assuming is just a limitation of how we write Terraform providers. I don't know if we can have some parameters only be conditionally mandatory. I'm not too bothered by that though tbh.

What actually is a problem however is that it looks like wait_for_run = false is being ignored here. What I was expecting here is a wait_for_run = false run to be fire-and-forget. i.e. start an Apply, and then don't care what happens next.

And when that happens, we get errors like this:

╷
│ Error: run errored while applying run run-PBHA5KP1vELeWepK (waited til status post_plan_running, currently status policy_checking): transition not allowed
│
│   with tfe_workspace_run.trigger_workspaces["webserver-azure-dev"],
│   on workspaces.tf line 207, in resource "tfe_workspace_run" "trigger_workspaces":
│  207: resource "tfe_workspace_run" "trigger_workspaces" {
│
╵

I can grab more detailed TF_LOG=debug logs later as needed. Like I say, the transition not allowed error is intermittent, so I guess it depends on what the API returns at the time, and this part is what you've already identified.

In cases where we do not get the error, the resource actually does confirm the apply itself.

image

To be fair, that IS expected behaviour for when manual_confirm = false, so I'm not worried about that. The problem is that wait_for_run isn't being respected.

(I can raise a separate issue for this if you'd prefer)

@nfagerlund
Copy link
Member Author

Oh hmm!!! 🤔 I wonder if...

Ok, so it looks like we only check wait_for_run AFTER we await the plan and do the branch on manual_confirm. So really all wait_for_run = false actually means right now is "don't wait for the apply." I think we maybe didn't consider some combinations of these as closely as we should have, and we might want to add a way to set auto-approve for the run if you REALLY want to fire-and-forget...

@lucymhdavies I think this is a separate issue, if you're willing to file one! It might benefit from a grid of expected behavior for the combinations of attrs. :D

@lucymhdavies
Copy link
Contributor

I was just gonna make a comment to that effect. I figured that out too :)

https://github.com/hashicorp/terraform-provider-tfe/blob/main/tfe/workspace_run_helpers.go#L146-L154

Yeah, I'll raise a separate issue. Please hold :)

@lucymhdavies
Copy link
Contributor

lucymhdavies commented Jul 4, 2023

With #921 merged and released, is this issue resolved, or do we have some additional work to do still?

(I thought for a second it was still broken, but then realised I was still using an old local build of the provider. With v0.46.0, while my testing has been far from extensive... seems not to be an issue anymore. Or, if it is, then it's at least far less frequent)

@Uk1288
Copy link
Contributor

Uk1288 commented Jul 5, 2023

Most of the expected run statuses and transitions have been added so this issue should happen very minimally whenever a missing status is encountered. These missing statuses will be added as discovered pending the more concrete run API solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants