Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic to UCP GetAWSResourceWithPost handler #8170

Merged
merged 8 commits into from
Jan 27, 2025

Conversation

willdavsmith
Copy link
Contributor

@willdavsmith willdavsmith commented Dec 26, 2024

Description

We've seen flaky functional test failures with AWS S3: #5963

This PR adds retries to the handler that I think is causing this 404 error.

  • Add pkg/retry directory for standard retries
  • Use pkg/retry in UCP GetAWSResourceWithPost handler

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).
  • This pull request adds or changes features of Radius and has an approved issue (issue link required).
  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and doesn't change the functionality of Radius (issue link optional).

Fixes: #7352

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
  • If applicable, design document has been reviewed and approved by Radius maintainers/approvers.
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Copy link

codecov bot commented Dec 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.01%. Comparing base (a1782fc) to head (77fba24).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8170      +/-   ##
==========================================
+ Coverage   59.95%   60.01%   +0.05%     
==========================================
  Files         590      591       +1     
  Lines       39513    39554      +41     
==========================================
+ Hits        23690    23737      +47     
+ Misses      14058    14054       -4     
+ Partials     1765     1763       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@radius-functional-tests
Copy link

radius-functional-tests bot commented Dec 27, 2024

Radius functional test overview

🔍 Go to test action run

Name Value
Repository willdavsmith/radius
Commit ref 395ecce
Unique ID funcd8bba2bb19
Image tag pr-funcd8bba2bb19
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcd8bba2bb19
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcd8bba2bb19
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcd8bba2bb19
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcd8bba2bb19
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcd8bba2bb19
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ corerp-cloud functional tests succeeded
✅ ucp-cloud functional tests succeeded

@@ -139,6 +139,7 @@ require (
github.com/sagikazarmark/locafero v0.6.0 // indirect
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect
github.com/sethvargo/go-retry v0.3.0 // indirect
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this package to simplify our retry logic across the project. Looks like it is well tested with no dependencies so I think it is a good choice. Let's discuss in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit seems to be from 6 months ago. Just wondering if that could be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be okay. The code is straightforward and has no dependencies, so hopefully there shouldn't need to be too many new commits.

}

// NewNoOpRetryer creates a new Retryer that does not retry.
func NewNoOpRetryer() *Retryer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is useful for testing. using this retryer should be the same functionality as we have today.

@@ -125,7 +125,6 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit

if existing {
// Get resource type schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

@@ -87,7 +87,7 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit
}

cloudControlOpts := []func(*cloudcontrol.Options){CloudControlRegionOption(region)}
cloudFormationOpts := []func(*cloudformation.Options){CloudFormationWithRegionOption(region)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this function to match the cloudcontrol version

@@ -74,7 +75,7 @@ func Test_GetAWSResourceWithPost(t *testing.T) {
CloudControl: testOptions.AWSCloudControlClient,
CloudFormation: testOptions.AWSCloudFormationClient,
}
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients)
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients, retry.NewNoOpRetryer())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should match the functionality we have today, i.e. these tests should pass with no other changes.

}

// NewGetAWSResourceWithPost creates a new GetAWSResourceWithPost controller with the given options and AWS clients.
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients) (armrpc_controller.Controller, error) {
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients, retryer *retry.Retryer) (armrpc_controller.Controller, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider adding the retryer to the other ucp awsproxy routes too, either in the future or this PR. I wanted to get some feedback first

@rynowak
Copy link
Contributor

rynowak commented Dec 28, 2024

I'm sure this is written up somewhere already, but I'd like to understand the failure pattern that we're addressing with this change.

Is it something like this? (please help me fill in the blanks)

  • A PUT operation is initiated.
  • The PUT operation succeeds asynchronously.
  • A GET operation against the same resource then fails with a 404.
  • At some point the future, the same GET operation will succeed if retried, without initiating any additional operations.

The background context is that any multi-regional control plane is eventually consistent. Azure/ARM/Bicep has a similar eventually consistent behavior underneath it (see notes above), and it's mostly hidden from users via the deployment engine.

@willdavsmith
Copy link
Contributor Author

I'm sure this is written up somewhere already, but I'd like to understand the failure pattern that we're addressing with this change.

Is it something like this? (please help me fill in the blanks)

  • A PUT operation is initiated.
  • The PUT operation succeeds asynchronously.
  • A GET operation against the same resource then fails with a 404.
  • At some point the future, the same GET operation will succeed if retried, without initiating any additional operations.

The background context is that any multi-regional control plane is eventually consistent. Azure/ARM/Bicep has a similar eventually consistent behavior underneath it (see notes above), and it's mostly hidden from users via the deployment engine.

This is exactly what I think is happening. I noticed that in the cases that I was investigating, the resource was actually created but returned a 404 on this route during deployment. My understanding of the DE is that it will perform a PUT operation, monitor the operation, and then do a GET at the end, where it calls UCP (getawsresourcewithpost handler) and returns a 404 because AWS says the resource doesn't exist yet. My hope is that adding retries here will make this situation more reliable without too much overhead. We can verify that it works if we see this issue less in the future, but until then, this is pretty much an educated guess as to what's happening and the solution.

@@ -139,6 +139,7 @@ require (
github.com/sagikazarmark/locafero v0.6.0 // indirect
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect
github.com/sethvargo/go-retry v0.3.0 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit seems to be from 6 months ago. Just wondering if that could be an issue.

}

return &Retryer{
config: retryConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can retryConfig ever be just empty? Like config is not nil but config.BackOffStrategy is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an okay case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find - I changed the code to not allow this case. now the retryer will use the default configuration in this case. I also added a test

Comment on lines +42 to +45
func TestNewRetryer(t *testing.T) {
config := &RetryConfig{
BackoffStrategy: goretry.NewConstant(1 * time.Second),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test this with a RetryConfig that has a nil BackOffStrategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test

Signed-off-by: willdavsmith <willdavsmith@gmail.com>
Signed-off-by: willdavsmith <willdavsmith@gmail.com>
@radius-functional-tests
Copy link

radius-functional-tests bot commented Jan 27, 2025

Radius functional test overview

🔍 Go to test action run

Name Value
Repository willdavsmith/radius
Commit ref 9931f87
Unique ID func1a65c2612f
Image tag pr-func1a65c2612f
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func1a65c2612f
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func1a65c2612f
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func1a65c2612f
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func1a65c2612f
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func1a65c2612f
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@willdavsmith willdavsmith requested a review from ytimocin January 27, 2025 19:14
@radius-functional-tests
Copy link

radius-functional-tests bot commented Jan 27, 2025

Radius functional test overview

🔍 Go to test action run

Name Value
Repository willdavsmith/radius
Commit ref 3850ee9
Unique ID funcc4ef32c5e8
Image tag pr-funcc4ef32c5e8
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcc4ef32c5e8
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcc4ef32c5e8
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcc4ef32c5e8
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcc4ef32c5e8
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcc4ef32c5e8
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@radius-functional-tests
Copy link

radius-functional-tests bot commented Jan 27, 2025

Radius functional test overview

🔍 Go to test action run

Name Value
Repository willdavsmith/radius
Commit ref 77fba24
Unique ID funcc8185c93f0
Image tag pr-funcc8185c93f0
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcc8185c93f0
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcc8185c93f0
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcc8185c93f0
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcc8185c93f0
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcc8185c93f0
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting ucp-cloud functional tests...
⌛ Starting corerp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@willdavsmith willdavsmith merged commit e4db991 into radius-project:main Jan 27, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add resiliency to the GET operation of AWS resource deployments
3 participants