Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google provider: Terraform state leaks on interrupt #14782

Open
ialidzhikov opened this issue Jun 1, 2023 · 7 comments
Open

google provider: Terraform state leaks on interrupt #14782

ialidzhikov opened this issue Jun 1, 2023 · 7 comments
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/terraform size/xl
Milestone

Comments

@ialidzhikov
Copy link

ialidzhikov commented Jun 1, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

% ./terraform --version
Terraform v1.3.9
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v4.53.1
+ provider registry.terraform.io/hashicorp/google-beta v4.53.1
+ provider registry.terraform.io/hashicorp/null v3.2.1

Affected Resource(s)

  • google_compute_network

Terraform Configuration Files

terraform {
  required_providers {
    google = {
      version = "4.53.1"
    }
    google-beta = {
      version = "4.53.1"
    }
    null = {
      version = "3.2.1"
    }
  }
}

provider "google" {
  credentials = "${file("serviceaccount.json")}"
  project     = "project-foo"
  region      = "europe-west1"
}

//=====================================================================
//= Service Account
//=====================================================================

resource "google_service_account" "serviceaccount" {
  account_id   = "shoot--local--foo"
  display_name = "shoot--local--foo"
}

//=====================================================================
//= Networks
//=====================================================================

resource "google_compute_network" "network" {
  name                    = "shoot--local--foo"
  auto_create_subnetworks = "false"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_subnetwork" "subnetwork-nodes" {
  name          = "shoot--local--foo-nodes"
  ip_cidr_range = "10.250.0.0/24"
  network       = google_compute_network.network.name
  region        = "europe-west1"

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_router" "router" {
  name    = "shoot--local--foo-cloud-router"
  region  = "europe-west1"
  network = google_compute_network.network.name

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_router_nat" "nat" {
  name                               = "shoot--local--foo-cloud-nat"
  router = google_compute_router.router.name
  region                             = "europe-west1"
  nat_ip_allocate_option             = "AUTO_ONLY"
  

  source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"
  subnetwork {
    name                    = google_compute_subnetwork.subnetwork-nodes.self_link
    source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
  }
  min_ports_per_vm = "2048"
  enable_endpoint_independent_mapping = false

  log_config {
    enable = true
    filter = "ERRORS_ONLY"
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}





//=====================================================================
//= Firewall
//=====================================================================

// Allow traffic within internal network range.
resource "google_compute_firewall" "rule-allow-internal-access" {
  name          = "shoot--local--foo-allow-internal-access"
  network       = google_compute_network.network.name
  source_ranges = ["10.250.0.0/24", "100.96.0.0/16"]
  allow {
    protocol = "icmp"
  }

  allow {
    protocol = "ipip"
  }

  allow {
    protocol = "tcp"
    ports    = ["1-65535"]
  }

  allow {
    protocol = "udp"
    ports    = ["1-65535"]
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

resource "google_compute_firewall" "rule-allow-external-access" {
  name          = "shoot--local--foo-allow-external-access"
  network       = google_compute_network.network.name
  source_ranges = ["0.0.0.0/0"]

  allow {
    protocol = "tcp"
    ports    = ["443"] // Allow ingress
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

// Required to allow Google to perform health checks on our instances.
// https://cloud.google.com/compute/docs/load-balancing/internal/
// https://cloud.google.com/compute/docs/load-balancing/network/
resource "google_compute_firewall" "rule-allow-health-checks" {
  name          = "shoot--local--foo-allow-health-checks"
  network       = google_compute_network.network.name
  source_ranges = [
    "35.191.0.0/16",
    "209.85.204.0/22",
    "209.85.152.0/22",
    "130.211.0.0/22",
  ]

  allow {
    protocol = "tcp"
    ports    = ["30000-32767"]
  }

  allow {
    protocol = "udp"
    ports    = ["30000-32767"]
  }

  timeouts {
    create = "5m"
    update = "5m"
    delete = "5m"
  }
}

// We have introduced new output variables. However, they are not applied for
// existing clusters as Terraform won't detect a diff when we run `terraform plan`.
// Workaround: Providing a null-resource for letting Terraform think that there are
// differences, enabling the Gardener to start an actual `terraform apply` job.
resource "null_resource" "outputs" {
  triggers = {
    recompute = "outputs"
  }
}

//=====================================================================
//= Output variables
//=====================================================================

output "vpc_name" {
  value = google_compute_network.network.name
}

output "cloud_router" {
  value = google_compute_router.router.name
  
}

output "cloud_nat" {
  value = google_compute_router_nat.nat.name
}



output "service_account_email" {
  value = google_service_account.serviceaccount.email
}

output "subnet_nodes" {
  value = google_compute_subnetwork.subnetwork-nodes.name
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

terraform/terraform-provider-gcp to be resilient to interrupts and to do not leak terraform state when interrupt is received. We run terraform in quite automated manner without human interaction. Everytime state leaks, a human operator has to analyse it and fix it manually.

Actual Behavior

terraform/terraform-provider-gcp leaks state when first terraform apply (that creates the resources) is interrupted.

Steps to Reproduce

  1. terraform init

  2. terraform apply -auto-approve

After the apply starts, interrupt in 3-5seconds. Logs:

% terraform apply -auto-approve

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

<omitted>

null_resource.outputs: Creating...
null_resource.outputs: Creation complete after 0s [id=6097200675366732611]
google_service_account.serviceaccount: Creating...
google_compute_network.network: Creating...
google_service_account.serviceaccount: Creation complete after 2s [id=projects/<omitted>/serviceAccounts/shoot--local--foo@<omitted>.iam.gserviceaccount.com]
^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
google_compute_network.network: Still creating... [10s elapsed]
╷
│ Error: execution halted
│
│
╵
╷
│ Error: execution halted
│
│
╵
╷
│ Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
│
│   with google_compute_network.network,
│   on main.tf line 34, in resource "google_compute_network" "network":
│   34: resource "google_compute_network" "network" {
│
╵

Note that the network creation fails with Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled. The network is created in GCP but not saved in the terraform state.
In a subsequent terraform apply -auto-approve, it fails with reason that the network already exists:

% terraform apply -auto-approve
null_resource.outputs: Refreshing state... [id=6097200675366732611]
google_service_account.serviceaccount: Refreshing state... [id=projects/<ommitted>/serviceAccounts/shoot--local--foo@<ommitted>.iam.gserviceaccount.com]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

<ommitted>

Plan: 7 to add, 0 to change, 0 to destroy.
google_compute_network.network: Creating...
╷
│ Error: Error creating Network: googleapi: Error 409: The resource 'projects/<ommitted>/global/networks/shoot--local--foo' already exists, alreadyExists
│
│   with google_compute_network.network,
│   on main.tf line 34, in resource "google_compute_network" "network":
│   34: resource "google_compute_network" "network" {
│
╵

The issue is always reproducible.

Important Factoids

References

@ialidzhikov ialidzhikov added the bug label Jun 1, 2023
@ialidzhikov ialidzhikov changed the title Terraform state leaks on interrupt google_compute_network: Terraform state leaks on interrupt Jun 1, 2023
@hao-nan-li
Copy link
Collaborator

For the network already exists error, looks like that could be clean-up on Pantheon.

For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.

@ialidzhikov ialidzhikov changed the title google_compute_network: Terraform state leaks on interrupt r/google_compute_network: Terraform state leaks on interrupt Jun 2, 2023
@ialidzhikov
Copy link
Author

For the first error, have you tried to increase the timeouts minutes? I've run similar tests in the past and it could take more than 5 minutes to create/update.

I don't think it is related to google_compute_network timeouts at all. It is properly described in the issue that it is about interrupt (SIGTERM, Ctrl + C) during google_compute_network creation and that terraform-provider-gcp cannot handle it and leaks terraform state (resource is created in the cloud provider but not present in the terraform state). According to our testing, when the terraform process is interrupted, it exists right away with:

^C
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

Stopping operation...
google_compute_network.network: Still creating... [10s elapsed]
╷
│ Error: execution halted
│
│
╵
╷
│ Error: execution halted
│
│
╵
╷
│ Error: Error waiting to create Network: Error waiting for Creating Network: error while retrieving operation: unable to finish polling, context has been cancelled
│
│   with google_compute_network.network,
│   on main.tf line 34, in resource "google_compute_network" "network":
│   34: resource "google_compute_network" "network" {
│
╵

I don't really see how google_compute_network timeouts are involved here. google_compute_network has already high timeouts:

Timeouts: &schema.ResourceTimeout{
Create: schema.DefaultTimeout(20 * time.Minute),
Update: schema.DefaultTimeout(20 * time.Minute),
Delete: schema.DefaultTimeout(20 * time.Minute),
},

As I already described - when terraform process is interrupted (SIGTERM), it exists right away with the above error and leaks the state/resource.

@ialidzhikov
Copy link
Author

ialidzhikov commented Jun 2, 2023

My assumption for the issue without checking the code: I know that there is an Operation API in GCP and terraform-provider-gcp should the the following: it firsts creates the resource and then pulls the Operation API to check whether the resource is created. If the resource is added to the state only when the operation is successful, then this might explain this issue. As you see above from the error, it fails to retrieve the operation for the network creation because context is already cancelled (because of the SIGTERM).

UPDATE: An issue from the past that I opened where failing to get the operation (rate limit exceeded) was leaking the resource - ref #8655.

@ialidzhikov ialidzhikov changed the title r/google_compute_network: Terraform state leaks on interrupt resource/google_compute_network: Terraform state leaks on interrupt Jun 2, 2023
@kon-angelo
Copy link

kon-angelo commented Jun 5, 2023

I think from the code the culprit is more or less here (L263):

err = ComputeOperationWaitTime(
config, res, project, "Creating Network", userAgent,
d.Timeout(schema.TimeoutCreate))
if err != nil {
// The resource didn't actually create
d.SetId("")
return fmt.Errorf("Error waiting to create Network: %s", err)
}

The provider basically unsets the ID of the resource if the polling operation fails. In case of the context cancelation the operation is considered as failed, which can be seen by Error waiting for Creating Network log message.

I think that always unsetting the ID of the resource is not helpful: aside from the context cancelation, e.g. network problems may cause the operation request to fail. In general, unsetting the ID only makes sense if in all the cases where the polling operation fails you are certain that the resource is not created, which is not the case here.

This is also a generic problem it seems. From the few resources that I inspected the same logic is implemented everywhere.


Also I am not sure what is gained exactly from unsetting the ID. The subsequent call of the terraform will anyway validate the existence of the resource. Isn't it more beneficial to account for the "worse case" scenario - that is that the resource was created and allow the subsequent calls to manage it without requiring operation intervention ?

@apparentlymart
Copy link
Contributor

Hi all! I normally work on Terraform Core but I'm here today because of an issue opened about this in the main Terraform Core repository about this bug.

I just wanted to confirm that the previous comment seems correct to me: if the request is cancelled then the provider must return a representation of state that is accurate enough for a subsequent plan and apply to succeed, so returning a null object (which is what the SDK does if you set the id to an empty string) should be done only if you are totally sure that no object has been created in the remote system.

In this case it seems like a situation where the remote API requires some polling after the object has been created to wait until it's ready to use. If that's true then a strategy I might suggest is to represent the status of the object as a computed attribute and then if you get cancelled during polling return a new state with the status set to something representing an error.

Then on the next plan you can check during refresh if the object became ready in the meantime anyway, and update the status attribute if so. If the object remains not ready during the planning phase then you could report that the status needs to change to "ready" and mark that change as "Force new" (in the SDK's terminology) so that the provider will have an opportunity to destroy and recreate the object during the next apply.

Other designs are possible too so I'm sharing the above only as one possible approach to avoid this problem. The general form of this idea is that if an object has been created in the remote system then you should always return an object representing it from the "apply" step. You can optionally return an error at the same time if the object is in an unsalvageable state, in which case Terraform will automatically mark is as "tainted" so it'll automatically get planned for replacement on the next plan, or you can handle it more surgically by handling cancellation as a success with the object somehow marked as incomplete so that the next plan and apply can deal with getting it into a working state somehow.

@rileykarson rileykarson removed the bug label Oct 2, 2023
@rileykarson rileykarson changed the title resource/google_compute_network: Terraform state leaks on interrupt google provider: Terraform state leaks on interrupt Oct 2, 2023
@rileykarson rileykarson added the persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work label Oct 2, 2023
@rileykarson
Copy link
Collaborator

Renamed this to cover resources generally.

@rileykarson
Copy link
Collaborator

rileykarson commented Oct 16, 2023

We're always removing the resource from state in generated resources if we get an error when polling, even if the issue is w/ the polling itself. We should identify cases where the resource may have been created better, or consider tainting more often (although that behaviour change across many resources would be too large for a minor version)

Thinking out loud: maybe we could call read even when we get an error, rather than returning immediately- that's probably our most reliable way of telling we succeeded. However, post-creates and similar may pose an issue.

@rileykarson rileykarson added this to the Goals milestone Oct 16, 2023
@github-actions github-actions bot added forward/review In review; remove label to forward service/compute-vpc labels Nov 11, 2023
@roaks3 roaks3 added service/terraform and removed forward/review In review; remove label to forward service/compute-vpc labels Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/terraform size/xl
Projects
None yet
Development

No branches or pull requests

6 participants