job modify index has changed since last refresh #398

shantanugadgil · 2023-12-04T10:14:28Z

Terraform Version

Terraform v1.6.5
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.29.0
+ provider registry.terraform.io/hashicorp/local v2.4.0
+ provider registry.terraform.io/hashicorp/nomad v2.0.0

Nomad Version

3 node server cluster at version 1.6.3

Provider Configuration

Which values are you setting in the provider configuration?

provider "nomad" {
  address = "http://nomad.somedomain:80"
  region  = "...."
}

Environment Variables

Do you have any Nomad specific environment variable set in the machine running Terraform?
Not Nomad specific, but we have TF_IN_AUTOMATION = "1" set as we run this under Atlantis.

Affected Resource(s)

nomad_job

We have a "common" job which runs using the nomad provider like so ...

Terraform Configuration Files

resource "nomad_job" "common" {
  hcl2 {
    allow_fs = true
  }
  purge_on_destroy = true

  jobspec = templatefile("${path.module}/common.nomad.tpl",
    {
      common_bash         = data.local_file.common_bash.content
    }
  )
}

We have Atlantis setup for automation.

The problem is that the atlantis plan works fine, but fails during apply.

Debug Output

N/A

Panic Output

N/A

Expected Behavior

apply should have worked properly

Actual Behavior

╷
│ Error: job modify index has changed since last refresh
│ 
│   with nomad_job.common,
│   on common.tf line 9, in resource "nomad_job" "common":
│    9: resource "nomad_job" "common" {
│ 
╵

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply

Important Factoids

this behavior is only seen when run under Atlantis.
If I merge the code and run it on the command line, this doesn't occur.

References

N/A

Q: Is there something we could do (like a force refresh before apply) to make this error go away?

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-12-16T02:28:40Z

Hi @shantanugadgil 👋

This error happens when the job is modified between a Terraform state refresh and the provider plans the job submission.

terraform-provider-nomad/nomad/resource_job.go

Lines 745 to 749 in 8f20175

    
           if resp != nil && resp.JobModifyIndex != wantModifyIndex { 
        
           	// Should rarely happen, but might happen if there was a concurrent 
        
           	// other process writing to Nomad since our Read call. 
        
           	return fmt.Errorf("job modify index has changed since last refresh") 
        
           }

Unfortunately I have not been able to reproduce it since it should only happen in a race condition. Do you, by any chance, have multiple Terraform runs modifying the same job at the same time?

Another unfortunate aspect of this is that Terraform SDK doesn't provide a way for use to get around this. State refresh is handled completely outside of the provider and by the time it is invoked, we only receive the state data. Retrying inside the provider would have no effect since the state data will always be the same.

I don't have a lot of experience with Atlantis, but is there a way to automatically retry failed runs? Or maybe reduce how many plan/applies are executed in parallel? The only way forward that I see to fix this would require changes outside of the provider.

Maybe we could make this check optional? But then it would result in a Terraform plan is almost guaranteed to result in data loss since the job was changed while the diff was computed.

shantanugadgil · 2023-12-18T09:32:59Z

Unfortunately I have not been able to reproduce it since it should only happen in a race condition. Do you, by any chance, have multiple Terraform runs modifying the same job at the same time?

There is nothing changing the job index outside of the Terraform way.

I could elaborate on the contents of the state... it is an AWS ASG with some Nomad jobs launching on it.

Job type is system and it is using the node.class to restrict the job to the ASG (this particular "common" job deploys some logrotate configs etc and sits in a sleep infinity)

Another unfortunate aspect of this is that Terraform SDK doesn't provide a way for use to get around this. State refresh is handled completely outside of the provider and by the time it is invoked, we only receive the state data. Retrying inside the provider would have no effect since the state data will always be the same.

Due to this, we use the Atlantis method only to check that we have no "compile errors" (basic syntax issues) and the atlantis apply is a best effort. When the atlantis apply fails (which is random), we proceed to apply from a "terraform machine" as mentioned in the "factoids" above.

since the bug report above, I have gone ahead and added the refresh=true but that hasn't helped :( it still fails randomly.

    apply:
      steps:
        - apply:
            extra_args: ["-refresh=true"]

As I am typing this, I realized another thing ... this job (and others like it) which randomly fail are in a different namespace than default.

(hunch) could that be related somehow?

lgfa29 · 2023-12-18T23:29:08Z

As I am typing this, I realized another thing ... this job (and others like it) which randomly fail are in a different namespace than default.

I've recently merged this change, which forces a namespace on the job plan request.

The wrong namespace could be an issue if:

You're running Terraform in a context where the NOMAD_NAMESPACE env var is set (like TFC or if your Atlantis runner is a Nomad allocation)
You have jobs with the same name on different namespaces.

But I suspect you would have a lot more problems than the modify index being different 🤔

Next time it happens, could you collect the value of modify_index in state for the resource being modified and also the ModifyIndex from the Noamd API?

shantanugadgil · 2023-12-19T16:15:36Z

Next time it happens, could you collect the value of modify_index in state for the resource being modified and also the ModifyIndex from the Noamd API?

OK, will try to capture these the next time.

VenelinMartinov · 2024-10-02T11:21:00Z

Hey @lgfa29, I am one of the developers of the pulumi-nomad provider, which uses the TF nomad provider under the hood.

Pulumi, unlike terraform does not run refresh by default and this issue affects users more than users of the TF provider. This should be very reproducible in TF with terraform apply -refresh=false.

Is it possible to supply a flag here to disable the index has changed since last refresh check? For some workflows that is not necessary and clobbering the existing state is the desired behaviour.

lgfa29 added theme/resource/job type/bug stage/thinking labels Dec 16, 2023

lgfa29 self-assigned this Dec 16, 2023

lgfa29 added stage/needs-investigation and removed stage/thinking labels Dec 18, 2023

automagic mentioned this issue Sep 30, 2024

Job modify index changing on server side causing drift pulumi/pulumi-nomad#422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job modify index has changed since last refresh #398

job modify index has changed since last refresh #398

shantanugadgil commented Dec 4, 2023

lgfa29 commented Dec 16, 2023

shantanugadgil commented Dec 18, 2023

lgfa29 commented Dec 18, 2023

shantanugadgil commented Dec 19, 2023

VenelinMartinov commented Oct 2, 2024

job modify index has changed since last refresh #398

job modify index has changed since last refresh #398

Comments

shantanugadgil commented Dec 4, 2023

Terraform Version

Nomad Version

Provider Configuration

Environment Variables

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

lgfa29 commented Dec 16, 2023

shantanugadgil commented Dec 18, 2023

lgfa29 commented Dec 18, 2023

shantanugadgil commented Dec 19, 2023

VenelinMartinov commented Oct 2, 2024