Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more path syntax for successCondition and failureCondition on resource templates #7143

Closed
jazka opened this issue Nov 3, 2021 · 11 comments
Labels
area/templates/resource solution/duplicate This issue or PR is a duplicate of an existing one type/feature Feature request

Comments

@jazka
Copy link

jazka commented Nov 3, 2021

Summary

Support more syntax for successCondition and failureCondition like gjson. The guide doc tells us the two fields use kubernetes label selection syntax and can be applied against any field of the resource (not just labels). But how to check the value of the field in set?

Use Cases

For PytorchJob, the status section will be used to check the job status. That means the successCondition should be set like successCondition=status.conditions.#(type="Created").status="True", but this syntax is not supported. Any way to deal with the case with array struct?

 status:
    completionTime: 2019-01-11T01:03:15Z
    conditions:
    - lastTransitionTime: 2019-01-11T00:51:48Z
      lastUpdateTime: 2019-01-11T00:51:48Z
      message: PyTorchJob pytorch-dist-mnist-gloo is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2019-01-11T00:57:22Z
      lastUpdateTime: 2019-01-11T00:57:22Z
      message: PyTorchJob pytorch-dist-mnist-gloo is running.
      reason: PyTorchJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: 2019-01-11T01:03:15Z
      lastUpdateTime: 2019-01-11T01:03:15Z
      message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
      reason: PyTorchJobSucceeded
      status: "True"
      type: Succeeded
    replicaStatuses:
      Master:
        succeeded: 1
      Worker:
        succeeded: 1
    startTime: 2019-01-11T00:57:22Z

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@jazka jazka added the type/feature Feature request label Nov 3, 2021
@terrytangyuan
Copy link
Member

status.replicaStatuses.Worker.succeeded > 1

@terrytangyuan
Copy link
Member

BTW, looks like you are using older versions of pytorch-operator in Kubeflow. I'd suggest trying out the latest version of https://github.com/kubeflow/training-operator?

@jazka
Copy link
Author

jazka commented Nov 3, 2021

Thanks for your response so quickly.

Yes, we used the training-operator.

It works with status.replicaStatuses.Worker.succeeded > 1 for successCondition, but how to set failureCondition? I think it will wrong to check status.replicaStatuses.*.failed count, since it may failed by any master or any worker, and the master and worker could be set restarted policy.

@terrytangyuan
Copy link
Member

That will be up to your training logic, e.g. whether you've enabled elastic training.

gjson is actually used underneath:

func matchConditions(jsonBytes []byte, successReqs labels.Requirements, failReqs labels.Requirements) (bool, error) {
ls := gjsonLabels{json: jsonBytes}
for _, req := range failReqs {
failed := req.Matches(ls)
msg := fmt.Sprintf("failure condition '%s' evaluated %v", req, failed)
log.Infof(msg)
if failed {
// We return false here to not retry when failure conditions met.
return false, errors.Errorf(errors.CodeBadRequest, msg)
}
}
numMatched := 0
for _, req := range successReqs {
matched := req.Matches(ls)
log.Infof("success condition '%s' evaluated %v", req, matched)
if matched {
numMatched++
}
}
log.Infof("%d/%d success conditions matched", numMatched, len(successReqs))
if numMatched >= len(successReqs) {
return false, nil
}
return true, errors.Errorf(errors.CodeNotFound, "Neither success condition nor the failure condition has been matched. Retrying...")
}

You may play with the unit test to test different conditions locally:

func TestResourceConditionsMatching(t *testing.T) {
var successReqs labels.Requirements
successSelector, err := labels.Parse("status.phase == Succeeded")
assert.NoError(t, err)
successReqs, _ = successSelector.Requirements()
assert.NoError(t, err)
var failReqs labels.Requirements
failSelector, err := labels.Parse("status.phase == Error")
assert.NoError(t, err)
failReqs, _ = failSelector.Requirements()
assert.NoError(t, err)
jsonBytes := []byte(`{"name": "test","status":{"phase":"Error"}`)
finished, err := matchConditions(jsonBytes, successReqs, failReqs)
assert.Error(t, err, `failure condition '{status.phase == [Error]}' evaluated true`)
assert.False(t, finished)
jsonBytes = []byte(`{"name": "test","status":{"phase":"Succeeded"}`)
finished, err = matchConditions(jsonBytes, successReqs, failReqs)
assert.NoError(t, err)
assert.False(t, finished)
jsonBytes = []byte(`{"name": "test","status":{"phase":"Pending"}`)
finished, err = matchConditions(jsonBytes, successReqs, failReqs)
assert.Error(t, err, "Neither success condition nor the failure condition has been matched. Retrying...")
assert.True(t, finished)
}

@jazka
Copy link
Author

jazka commented Nov 3, 2021

I checked the source code of PytorchJob and I think it is more accurate by monitor status.conditions. Also we did some tests with status.replicaStatuses and there is no suitable method to set failureCondition.

I also read the source code of argo-workflow and notice gjson is used. But the path syntax supported by gjson will be refused by the following logic at the beginning:

var failReqs labels.Requirements
if we.Template.Resource.FailureCondition != "" {
failSelector, err := labels.Parse(we.Template.Resource.FailureCondition)
if err != nil {
return errors.Errorf(errors.CodeBadRequest, "fail condition '%s' failed to parse: %v", we.Template.Resource.FailureCondition, err)
}
log.Infof("Failing for conditions: %s", failSelector)
failReqs, _ = failSelector.Requirements()
}

It refers to:
https://github.com/kubernetes/apimachinery/blob/master/pkg/labels/selector.go#L891-L896 and https://github.com/kubernetes/apimachinery/blob/master/pkg/util/validation/validation.go

@terrytangyuan
Copy link
Member

Would you like to submit a PR to support the additional syntax or make it fully compatible with gjson?

@jazka
Copy link
Author

jazka commented Nov 4, 2021

Would you like to submit a PR to support the additional syntax or make it fully compatible with gjson?

Sure, I could try it.

BTW, the description of failureCondition should be updated on official docs . The default label selector syntax is AND, but actually its syntax is OR for failureCondition.
image

This is very misleading if user does not check the code here.

func matchConditions(jsonBytes []byte, successReqs labels.Requirements, failReqs labels.Requirements) (bool, error) {
ls := gjsonLabels{json: jsonBytes}
for _, req := range failReqs {
failed := req.Matches(ls)
msg := fmt.Sprintf("failure condition '%s' evaluated %v", req, failed)
log.Infof(msg)
if failed {
// We return false here to not retry when failure conditions met.
return false, errors.Errorf(errors.CodeBadRequest, msg)
}
}
numMatched := 0
for _, req := range successReqs {
matched := req.Matches(ls)
log.Infof("success condition '%s' evaluated %v", req, matched)
if matched {
numMatched++
}
}
log.Infof("%d/%d success conditions matched", numMatched, len(successReqs))
if numMatched >= len(successReqs) {
return false, nil
}
return true, errors.Errorf(errors.CodeNotFound, "Neither success condition nor the failure condition has been matched. Retrying...")
}
.

@terrytangyuan
Copy link
Member

Feel free to update the docs along the way as well.

@whybeyoung
Copy link
Contributor

the same problem here. failureCondition should support array struct here. For example, when set tfjob pod restart policy to
OnFailure and set a backoffLimit:3.
now if exceed the backoffLimit 3, job failed ,like this
image

now, failureCondition can't support this array struct .
BTW, we can also change operator code.....but i thinks we should. support here , in argo is more useful

@whybeyoung
Copy link
Contributor

according gjson , and my try......

use this fix my problem:

image
like gjson doc say..
https://github.com/tidwall/gjson

However ,some complex syntax will be conflict with label selector syntax..
BTW, do we really need label selector syntax? i thinks jsonPath is better.. @terrytangyuan

@agilgur5 agilgur5 changed the title Support more path syntax for successCondition and failureCondition Support more path syntax for successCondition and failureCondition on resource templates Oct 18, 2024
@agilgur5
Copy link
Member

This seems to have been a duplicate of #2288 which has a decent bit more upvotes. The discussions don't exactly overlap though

@agilgur5 agilgur5 added the solution/duplicate This issue or PR is a duplicate of an existing one label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/templates/resource solution/duplicate This issue or PR is a duplicate of an existing one type/feature Feature request
Projects
None yet
Development

No branches or pull requests

5 participants