Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create New Resource Group: Status=404 Code="ResourceGroupNotFound" #18268

Closed
1 task done
pkirch opened this issue Sep 6, 2022 · 28 comments · Fixed by #25758
Closed
1 task done

Create New Resource Group: Status=404 Code="ResourceGroupNotFound" #18268

pkirch opened this issue Sep 6, 2022 · 28 comments · Fixed by #25758

Comments

@pkirch
Copy link

pkirch commented Sep 6, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.2.5

AzureRM Provider Version

3.15.1

Affected Resource(s)/Data Source(s)

azurerm_resource_group

Terraform Configuration Files

# Minimal config. Complete files linked here: https://github.com/agera-edc/MinimumViableDataspace/tree/ec1999cc7a8582407f7d089fb9396dde023e58bb/deployment/terraform/participant

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 3.1.0"
    }
  }

  backend "azurerm" {}
}

provider "azurerm" {
  features {
    key_vault {
      purge_soft_delete_on_destroy = true
    }
  }
}

resource "azurerm_resource_group" "participant" {
  name     = var.resource_group
  location = var.location
}

variable "resource_group" {
  default = "test-resource-group"
}

variable "location" {
  default = "northeurope"
}

Debug Output/Panic Output

Excerpt of error message. Full output: https://gist.github.com/pkirch/a674369d480389ce2ddd57f24499e5b2

azurerm_resource_group.participant: Creating...
╷
│ Error: retrieving Resource Group "rg-company1-mvd116": resources.GroupsClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource group 'rg-company1-mvd116' could not be found."
│ 
│   with azurerm_resource_group.participant,
│   on main.tf line 54, in resource "azurerm_resource_group" "participant":
│   54: resource "azurerm_resource_group" "participant" {
│ 
╵
Releasing state lock. This may take a few moments...
Error: Process completed with exit code 1.

Expected Behaviour

New Azure resource group should be created reliable without error and should exist after creation.

Actual Behaviour

Deployment stopped with error mentioned.

Error occurred sporadically. Our logs show 12 failures in 662 runs.
Failures happend only in a certain time windows from 2022-07-27 02:57 p.m. to 2022-07-28 09:49 a.m.

I expect this issue is hard to troubleshoot from the data given. However, we hope filing this issue helps others in case it happens sporadically again.

As the failures happend already a few weeks ago, Terraform version and AzureRM Provider version are stated as used when the errors occurred.

Steps to Reproduce

We have a GitHub action workflow executing the following commands. (complete file)

      - name: 'Run terraform'
        id: runterraform
        run: |
          # Create backend.conf file to retrieve the remote terraform state during terraform init.
          echo '
            resource_group_name  = "${{ secrets.COMMON_RESOURCE_GROUP }}"
            storage_account_name = "${{ secrets.TERRAFORM_STATE_STORAGE_ACCOUNT }}"
            container_name       = "${{ secrets.TERRAFORM_STATE_CONTAINER }}"
            key                  = "${{ env.RESOURCES_PREFIX }}.tfstate"
          ' >> backend.conf
          terraform init -backend-config=backend.conf
          terraform apply -auto-approve

Important Factoids

No response

References

Issues who seem similar, however, closed and/or fixed a long time ago.

@pkirch pkirch added the bug label Sep 6, 2022
@github-actions github-actions bot removed the bug label Sep 6, 2022
@Amier3 Amier3 added the question label Sep 8, 2022
@jpmicrosoft
Copy link
Contributor

@pkirch It seems like a dependency issue.

Add a depends on to the resources the resource group.
depends_on = [
azurerm_resource_group.participant
]
Reference
https://www.terraform.io/language/meta-arguments/depends_on

I hope this helps.

@DizzyDeveloper
Copy link

DizzyDeveloper commented Apr 16, 2024

I am currently having the same issue with azure resource groups.

I get maybe one or two of these message from terraform output:
azurerm_resource_group.default_resource_group: Creating...

Before I get the same 404 error as above:

Error: retrieving Resource Group "zvt4xdts-rg": resources.GroupsClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource group 'zvt4xdts-rg' could not be found."

The annoying thing is tho, is that if I check the subscription that resource group does actually exist, so it does get created.

It also doesn't happen repeatably, we have a variety of different terraform projects each creating there own resource group(s), and this failure seems to happen arbitrarily for different projects.
And in all cases the resource group does actually get created. But terraform fails as per error above.

@ggomes-agc
Copy link

I am also experienceing this issue faily consistently with azurerm 3.99 (also tried various versions from 3.70 onward)

A debug of the deployment shows that the resource is created. The final call to the API for validation responds with not found.
Then errors out and does not write to state.

Subsequent apply says that the resource exists and must be imported into state.

Not Working Deployment
2024-04-17T17:22:21.410Z [DEBUG] provider.terraform-provider-azurerm_v3.70.0_x5: AzureRM Request:
GET /subscriptions/3828d6d3-52e4-439a-8621-b528739f9fee/resourcegroups/GgPc-ggd-prod_paz_network-rg?api-version=2020-06-01 HTTP/1.1
HTTP/2.0 404 Not Found

2024-04-17T17:22:21.476Z [DEBUG] provider.terraform-provider-azurerm_v3.70.0_x5: AzureRM Request:
PUT /subscriptions/3828d6d3-52e4-439a-8621-b528739f9fee/resourcegroups/GgPc-ggd-prod_paz_network-rg?api-version=2020-06-01 HTTP/1.1
HTTP/2.0 201 Created

2024-04-17T17:22:21.615Z [DEBUG] provider.terraform-provider-azurerm_v3.70.0_x5: AzureRM Request:
GET /subscriptions/3828d6d3-52e4-439a-8621-b528739f9fee/resourcegroups/GgPc-ggd-prod_paz_network-rg?api-version=2020-06-01 HTTP/1.1
HTTP/2.0 404 Not Found
{"error":{"code":"ResourceGroupNotFound","message":"Resource group 'GgPc-ggd-prod_paz_network-rg' could not be found."}}: timestamp=2024-04-17T17:22:21.678Z

Our code deployed 12 resource groups via module calls. Once or two of the RGs fail consistently. It is not consistent which RGs fail. We have about a 10% success rate on apply.

This code worked consistently a month ago.

@KealeyGR
Copy link

I'm encountering a similar issue with azurerm version 3.95. Like you, I've tried various versions from 3.70 onward with no success. During deployment, the resource is created but the final API call for validation consistently returns 'not found,' resulting in errors and failure to write to state.

We're deploying multiple resource groups via module calls, and about 10% of the time, a few of these RGs consistently fail. Strangely, the failing RGs vary with each deployment. This behavior is inconsistent with what we experienced a month ago when the code worked reliably.

I'd appreciate any insights or suggestions on how to resolve this issue. Thanks!"

@dantape
Copy link

dantape commented Apr 18, 2024

I am using azurerm 3.83.0 and we started getting this a couple days ago. Our build servers run ubuntu and our builds that create resource groups are not having much success. Sometimes the resource group is created and sometimes it isn't but we are pretty consistently getting the NotFound error either way. When running locally on windows, my latest configuration worked fine the first time. I am not sure if there is any correlation here though between OSs yet, not enough data. I am thinking about opening a ticket with Microsoft with so many provider versions being affected did one of their APIs change?

@dingliu
Copy link
Contributor

dingliu commented Apr 18, 2024

We’re having this issue as well. When creating multiple resource groups in parallel, inconsistently the creation of this or that resource group fail. The error message is 404 resource group could not be found.

We have tried provider version 3.54.0 to 3.99.0 and the issue persists.

The same code was working about two weeks ago. We started experiencing this issue since last two weeks.

@eddieb96
Copy link

We are also having it while creating multiple resource groups in parallel. Tested provider version 3.100 and issue persists.

@dantape
Copy link

dantape commented Apr 19, 2024

I enabled trace logging and was seeing some weird behavior. I was on terraform version 1.3.6. After updating to 1.8.1 I have not seen this issue again.

@haodeon
Copy link

haodeon commented Apr 21, 2024

Been getting this error for over a week now.

Tried upgrading terraform to 1.8.1

Still having issues with resource group creation. Getting Provider produced inconsistent result after apply and Root object was present, but now absent

Resource group still gets created but doesn't get saved into state. To me it seems like the same issue but different error message.

@DizzyDeveloper
Copy link

DizzyDeveloper commented Apr 21, 2024

Hi @katbyte,
I am pinging you to see if we can get some input from one of the hashicorp devs on this issue, haven't heard from you guys on this issue yet. Just checking you guys are aware that it has picked up quiet a bit over the last week.

@chalecado
Copy link

happening for us too - have had to disable azure in or dev environment due to this issue

@alexpilon666
Copy link

Been having the exact same problem here inconsistently for the past couple of weeks, always using the latest AzureRM provider version available at the time.

We have a suite of tests that runs in CI in Azure DevOps which will execute a bunch of tests (terraform init/plan/apply/destroy) when a PR is created and it contains a modification to one of our modules. This sometimes means that we may have 30+ tests that will be queued in our ADO pipeline. I just launch a test suite, and with 8 parallel jobs running (1 job == terraform init/plan/apply/destroy performed on one test suite), I have 5 that failed immediately after trying to create the RG with the exact same issue. I now have 8 currently-running tests that managed to create the resource group just fine and proceeded with the rest.

@MikeSchiessl
Copy link

This sometimes also happened to me whilst running the identical TF code over and over.
There seems to be something wrong on the AZURE side, as we're getting a 404 from AZURE whilst the resource_group has been successfully created but AZURE doesn't seem to be able to find it.

@favoretti
Copy link
Collaborator

favoretti commented Apr 25, 2024

Here's and update I got from MSFT support.

I am an Azure support engineer on the ARM team, and I will be working with you on this case. I understand that you are seeing intermittent 404 errors over the last two weeks.

Our product teams are aware of this problem and working on this issue. There is a lot of work spanning multiple services working on this behavior. I will keep you up to date with their progress.

The issue is related to replication of ARM data among regions. For example, another customer has some requests going to East US and other requests to East US 2, and during the time it takes to replicate between the two, they get 404's. The database account is a multi-master account with session consistency - so, write operations will be replicated across regions asynchronously. Session consistency only guarantees read-you-write guarantees within the scope of a session which is either defined by the application (ARM) or by the SDK (in which case the session spans only a single CosmosClient instance) - and given that several of the reads returning 404 after the creation of the resource group were done not only from a different ARM FD machine but even from a different region, they were made outside of the session scope - so, effectively eventually consistent. ARM team has worked in the past to make the multi-master model work transparently, and I assume they will continue this work as will our other teams working on the problem.

I will keep you up to date. Thanks for reporting the issue.

@Zezo0001
Copy link

Zezo0001 commented Apr 25, 2024

@favoretti That response is in accurate because in our case we're working with ONLY one region and noticing the error

@favoretti
Copy link
Collaborator

favoretti commented Apr 25, 2024

That response is in accurate because in our case we're working with ONLY one region and noticing the error

ARM loadbalancer will send you places. It's not related to the region where you are creating the resources.

@alexpilon666
Copy link

@zoelfakar1 same here. Our use case for deploying tests/examples deploys everything in a single region, in a single subscription. The very first and essentially only pre-step of deploying our tests/examples generates a 4-character random string in Terraform and then creates the resource group. Once the resource group is created using a terraform apply -target module.helper (we created a helper module to manage this instead of repeating the code in all of our tests/examples), then we run a separate terraform apply to actually create everything in our test/example.

So in our case, we're not even trying to manage/create anything other than the resource group, and it still fails at least 20% of the time, whether we're running one or more tests/examples at a time in our CI pipeline.

@ggomes-agc
Copy link

Same for us. We are only deploying to Canada Central.

@Zezo0001
Copy link

We have also noticied a delay in the resource group creation in the azure portal (i.e it gets created after the terraform apply finishes/errors out). Which makes sense as to why the final API call for validation returns '""Resource group 'XXX-XXX' could not be found."" (as it did not exisit then)

@srjennings
Copy link

+1, same issue.

@favoretti
Copy link
Collaborator

I'm working on a "fix" for this. So far it seems to work, I'm going to run an overnight test for this, after which we can discuss merging it upstream.

@favoretti
Copy link
Collaborator

To address comments that are referring to "deployments to a single region". ARM API itself is multi-region. Each request that provider sends to the API can potentially create a new HTTP session, which means session consistency on ARM backend won't help. CreateOrUpdate() method in the resource will send a POST request to the ARM API, which would return a success. To populate the resource data, CreateOrUpdate() calls a resourceResourceGroupRead() method, which in turn calls a Get() client method. Due to the fact that it might end up being another session - it's not guaranteed that your ARM API request will land in the same region of ARM API (not related to the region you're creating your resources in, management.azure.com is a globally loadbalanced thing). If the eventually consistent database that backs azure resources has not finished replicating the fact that your RG was created - it will rightfully respond with a 404 - according to that ARM API instance - the resource group doesn't exist.

Provider, as a consequence, will error out. Subsequent TF run will give you an error that resource group already exists and requires import, most likely because it takes just a couple more seconds for the data to be reconciled across azure backend databases.

The kludge I added will just retry Get() on the resource until it consistently finds it 5 times in a row, after which we can be fairly certain it's all good.

Hope this helps clarify the issue and attempted workaround.

@haodeon
Copy link

haodeon commented Apr 26, 2024

I suspect there has been a change to the ARM service. From my own experience and others who have commented on this issue, the recent problem started appearing on 12th of April.

I agree it’s probably an eventual consistency problem. I wrote a python script which loops through creating, reading and deleting resource groups, printing the response headers and whenever there is a 404 returned for Get resource group the x-ms-routing-request-id is always served from a different region to the one the resource group is created in.

Upon discovering this I opened a support ticket with MS. This might be an intended change and the azurerm provider will need to adapt, the provider code has not been touched for a long time and the API version is fixed to a deprecated version of the Go SDK.

@haodeon
Copy link

haodeon commented Apr 28, 2024

Azure support got back to me and said the product group made a fix to over the weekend.

@favoretti
Copy link
Collaborator

Azure support got back to me and said the product group made a fix to over the weekend.

They might, however that's not the first time this issue resurfaces unfortunately. Also, my contacts reported nothing about a fix yet :)

@eddieb96
Copy link

Started seeing this again yesterday on the 3.100.0 Azure Terraform provider.

@favoretti
Copy link
Collaborator

Started seeing this again yesterday on the 3.100.0 Azure Terraform provider.

Try 3.102.0 please - that's where my workaround got merged. Would be interested in hearing if it helps.

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.