retry build errors #1194

bparees · 2015-03-02T01:36:16Z

For now this will retry retryable failures forever on a tight loop. I'm still contemplating switching it to something like 100. The problem is that without backoff (which the current retrycontroller logic does not offer for good reason), if there's a temporary API outage we'll burn through any reasonable number of retries rapidly.

bparees · 2015-03-02T01:38:12Z

[test]

openshift-bot · 2015-03-02T01:41:46Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_openshift3/1238/)

smarterclayton · 2015-03-02T04:19:51Z

pkg/build/controller/image_change_controller.go

I don't think this message belongs here - retry decisions belong in a higher layer.

the controller is the bit that has the logic to know whether a particular event can/should be retried or not (based on what state has been modified or would be duplicated if the event is replayed) so i don't see how you can(well, should) defer the decision to retry out of the controller. The fact is the controller is making the decision to retry or not based on the type of error it returns and knowing the logic of the retry controller anyway. There's really no avoiding the fact that the controller needs to have some knowledge of the types of errors the retry handler is expecting and how it will handle them. the less of that there is, the better, imho. (which is why I preferred the approach that just returned RetryableError vs FatalError wrapping the real error)

Higher layer = your retry handler. That's where you make a decision about retry. Adding it here too confuses the issue

bparees · 2015-03-04T19:26:16Z

@smarterclayton updated per your comments. thoughts? notable and debatable design points:

retryable errors are retried forever (max attempts == -1) in a tight loop.

i'm not averse to reducing this to 10 or 100, but since it's a tight loop my guess is anything that fails once will fail N times because the error condition won't be corrected fast enough for it to have a chance of succeeding later (eg the api endpoint becomes available again)

all errors (retried or not) go through kutil.HandleError

if something is spinning in a retry loop it seems important to make that apparent in the logs, particularly since we're not going to eventually perm-fail it (see Add link to GOPATH doc and fixup typo #1)

smarterclayton · 2015-03-04T21:40:33Z

pkg/build/controller/factory/factory.go

smarterclayton · 2015-03-04T21:41:25Z

Minor comment, otherwise this is fine.

bparees · 2015-03-04T21:45:11Z

godoc'd. [merge]

openshift-bot · 2015-03-04T21:46:52Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_openshift3/1115/) (Image: devenv-fedora_973)

openshift-bot · 2015-03-04T21:46:52Z

Evaluated for origin up to 2841124

smarterclayton · 2015-03-04T22:15:07Z

post-hoc LGTM

----- Original Message -----

godoc'd. [merge]

Reply to this email directly or view it on GitHub:
#1194 (comment)

Merged by openshift-bot

UPSTREAM: 59931: do not delete node in openstack, if those still exist in cloudprovider

bparees changed the title ~~retry build errors~~ [WIP] retry build errors Mar 2, 2015

smarterclayton reviewed Mar 2, 2015
View reviewed changes

bparees mentioned this pull request Mar 2, 2015

Controllers do not requeue events on transient errors #824

Closed

bparees force-pushed the build_errors branch 4 times, most recently from 657c533 to 482ca27 Compare March 4, 2015 19:21

bparees changed the title ~~[WIP] retry build errors~~ retry build errors Mar 4, 2015

smarterclayton reviewed Mar 4, 2015
View reviewed changes

pkg/build/controller/factory/factory.go Outdated

Copy link

Contributor

smarterclayton Mar 4, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc

retry build errors

2841124

bparees force-pushed the build_errors branch from 482ca27 to 2841124 Compare March 4, 2015 21:44

openshift-bot pushed a commit that referenced this pull request Mar 4, 2015

Merge pull request #1194 from bparees/build_errors

4cc3962

Merged by openshift-bot

openshift-bot merged commit 4cc3962 into openshift:master Mar 4, 2015

bparees deleted the build_errors branch March 5, 2015 15:28

Miciah pushed a commit to Miciah/origin that referenced this pull request Jun 27, 2018

Merge pull request openshift#1194 from sjenning/pick-59931-3.9-ose

09fbdcb

UPSTREAM: 59931: do not delete node in openstack, if those still exist in cloudprovider

retry build errors #1194

retry build errors #1194

Uh oh!

Conversation

bparees commented Mar 2, 2015

Uh oh!

bparees commented Mar 2, 2015

Uh oh!

openshift-bot commented Mar 2, 2015

Uh oh!

smarterclayton Mar 2, 2015

Choose a reason for hiding this comment

Uh oh!

bparees Mar 2, 2015

Choose a reason for hiding this comment

Uh oh!

smarterclayton Mar 2, 2015

Choose a reason for hiding this comment

Uh oh!

bparees commented Mar 4, 2015

Uh oh!

smarterclayton Mar 4, 2015

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Mar 4, 2015

Uh oh!

bparees commented Mar 4, 2015

Uh oh!

openshift-bot commented Mar 4, 2015

Uh oh!

openshift-bot commented Mar 4, 2015

Uh oh!

smarterclayton commented Mar 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants