Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency #34697

Open
ribbonhood opened this issue Dec 2, 2023 · 3 comments
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/sfn Issues and PRs that pertain to the sfn service.

Comments

@ribbonhood
Copy link

ribbonhood commented Dec 2, 2023

Terraform Core Version

1.6.5

AWS Provider Version

5.29.0

Affected Resource(s)

aws_sfn_state_machine

Expected Behavior

State machine version is updated and pointed to the new alias

Actual Behavior

State machine update times out and fails.

Relevant Error/Panic Output Snippet

╷
│ Error: waiting for Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) update: Step Functions State Machine (arn:aws:states:XXX:XXX:stateMachine:test_state_machine_1) eventual consistency
│ 
│   with aws_sfn_state_machine.state_machine_1,
│   on test_sf.tf line 10, in resource "aws_sfn_state_machine" "state_machine_1":
│   10: resource "aws_sfn_state_machine" "state_machine_1" {
│ 
╵

Terraform Configuration Files

data "template_file" "sf_template" {
  template = file("${path.module}/definition.json.tpl")
}

resource "aws_iam_role" "step-functions-role" {
  name = "test_sf_1"
  assume_role_policy = file("${path.module}/step-functions-role.json")
}

resource "aws_sfn_state_machine" "state_machine_1" {
  name     = "test_state_machine_1"
  role_arn = aws_iam_role.step-functions-role.arn
  publish = true

  logging_configuration {
    include_execution_data = false
  }
  definition = data.template_file.sf_template.rendered

  /*lifecycle {
    replace_triggered_by = [value]
  }*/
  timeouts {
    #create = "5m"
    update = "2m"
  }
}

data "aws_sfn_state_machine_versions" "state_machine_1_versions" {
  statemachine_arn = aws_sfn_state_machine.state_machine_1.arn
}

resource "aws_sfn_alias" "sfn_active_alias" {
  name = "test_state_machine_1_active"

  routing_configuration {
    state_machine_version_arn = element(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions, length(data.aws_sfn_state_machine_versions.state_machine_1_versions.statemachine_versions)-1)
    weight                    = 100
  }

  depends_on = [time_sleep.wait_for_step_function]
}


resource "time_sleep" "wait_for_step_function" {
  create_duration = "30s"
  triggers = {
    role = aws_sfn_state_machine.state_machine_1.arn
  }
}

####DEFINITION#####

{
  "Comment": "Test SM",
  "StartAt": "Step 1",
  "States": {
    "Step 1": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "Func1",
        "Payload": {
          "payload.$": "$",
          "token.$": "$$.Task.Token"
        }
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 5,
          "MaxAttempts":2,
          "BackoffRate":2
        }
      ],
      "Next": "Step 2",
      "ResultSelector": {
        "request.$": "$$.Execution.Input",
        "result.$": "$"
      }
    },
    "Step 2": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "Func2"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException",
            "Lambda.TooManyRequestsException"
          ],
          "IntervalSeconds": 4,
          "MaxAttempts":1,
          "BackoffRate":1
        }
      ],
      "End": true,
      "TimeoutSeconds": 28800
    }
  },
  "TimeoutSeconds": 86430
}

####ROLE#######

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": [
          "states.amazonaws.com"
        ]
      },
      "Effect": "Allow"
    }
  ]
}

Steps to Reproduce

Run terraform apply to create the resources
Run terraform apply again, even without making any changes and the update fails.

Debug Output

  http.response.body=
  | {"creationDate":1.701436609239E9,"definition":"{\n  \"Comment\": \"Test SM\",\n  \"StartAt\": \"Step 1\",\n  \"States\": {\n    \"Step 1\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke.waitForTaskToken\",\n      \"Parameters\": {\n        \"FunctionName\": \"Func1\",\n        \"Payload\": {\n          \"payload.$\": \"$\",\n          \"token.$\": \"$$.Task.Token\"\n        }\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 5,\n          \"MaxAttempts\":2,\n          \"BackoffRate\":2\n        }\n      ],\n      \"Next\": \"Step 2\",\n      \"ResultSelector\": {\n        \"request.$\": \"$$.Execution.Input\",\n        \"result.$\": \"$\"\n      }\n    },\n    \"Step 2\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::lambda:invoke\",\n      \"Parameters\": {\n        \"Payload.$\": \"$\",\n        \"FunctionName\": \"Func2\"\n      },\n      \"Retry\": [\n        {\n          \"ErrorEquals\": [\n            \"Lambda.ServiceException\",\n            \"Lambda.AWSLambdaException\",\n            \"Lambda.SdkClientException\",\n            \"Lambda.TooManyRequestsException\"\n          ],\n          \"IntervalSeconds\": 4,\n          \"MaxAttempts\":1,\n          \"BackoffRate\":1\n        }\n      ],\n      \"End\": true,\n      \"TimeoutSeconds\": 28800\n    }\n  },\n  \"TimeoutSeconds\": 86430\n}","loggingConfiguration":{"__type":"com.amazonaws.swf.base.model#LoggingConfiguration","includeExecutionData":false,"level":"OFF"},"name":"test_state_machine_1","revisionId":"72aa6bea-68f1-4a29-8b0a-c193390a4f96","roleArn":"arn:aws:iam::XXXX:role/test_sf_1","stateMachineArn":"arn:aws:states:XXXX:XXXX:stateMachine:test_state_machine_1","status":"ACTIVE","tracingConfiguration":{"__type":"com.amazonaws.swf.base.model#TracingConfiguration","enabled":false},"type":"STANDARD"}```


### Panic Output

_No response_

### Important Factoids

No

### References

_No response_

### Would you like to implement a fix?

None
@ribbonhood ribbonhood added the bug Addresses a defect in current functionality. label Dec 2, 2023
Copy link

github-actions bot commented Dec 2, 2023

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added service/iam Issues and PRs that pertain to the iam service. service/sfn Issues and PRs that pertain to the sfn service. labels Dec 2, 2023
@terraform-aws-provider terraform-aws-provider bot added the needs-triage Waiting for first response or review from a maintainer. label Dec 2, 2023
@ribbonhood
Copy link
Author

After some tinkering it appears the issue is related to having logging_configuration with level not explicitly set.

logging_configuration {
    include_execution_data = false
}

When no default is set for level, there's a bug that tries to recreate the state machine and in turn I get this error. Explicitly adding level=OFF doesn't recreate the sate machine and updates work as expected.

logging_configuration {
    level = "OFF"
    include_execution_data = false
}

I'll leave this open as it may be an actual bug that needs to be looked into.

@justinretzolk justinretzolk added eventual-consistency Pertains to eventual consistency issues. and removed service/iam Issues and PRs that pertain to the iam service. needs-triage Waiting for first response or review from a maintainer. labels Jan 18, 2024
@brainsiq
Copy link

brainsiq commented Aug 14, 2024

I've had a similar issue which seemed to be caused by not setting kms_data_key_reuse_period_second in encryption_configuration.

Every apply would do an update in place to set the value from 300 (the default) to null and more often than not would produce the same eventual consistency error. It was also updating the version (with publish=true), which stopped happening after adding the encryption setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. eventual-consistency Pertains to eventual consistency issues. service/sfn Issues and PRs that pertain to the sfn service.
Projects
None yet
Development

No branches or pull requests

3 participants