Skip to content

Conversation

@clumsy
Copy link
Collaborator

@clumsy clumsy commented Oct 31, 2025

Make k8s scheduler cancel abort the job, instead of removing it completely.

Adding delete command for all schedulers to override, default implementation calls cancel to mimic the current behavior.

Test plan:
[x] added unit tests

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2025
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 31, 2025

Please check this proposal, @kiukchung
Adding -d to cancel is not the best option I think since we either break all custom schedulers or we need to use introspection.

Let me know if you want to limit the scope to just change the behavior of cancel for k8s.

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 89.28571% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.58%. Comparing base (1e0492c) to head (3dbf4b4).

Files with missing lines Patch % Lines
torchx/runner/api.py 83.33% 1 Missing ⚠️
torchx/schedulers/api.py 50.00% 1 Missing ⚠️
torchx/schedulers/kubernetes_scheduler.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
- Coverage   91.62%   91.58%   -0.05%     
==========================================
  Files          84       85       +1     
  Lines        6589     6617      +28     
==========================================
+ Hits         6037     6060      +23     
- Misses        552      557       +5     
Flag Coverage Δ
unittests 91.58% <89.28%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 31, 2025

Some testing evidence @kiukchung

  • torchx run ...:
...
kubernetes://torchx/some-namespace:some-app
torchx 2025-10-31 15:29:30 INFO     Launched app: kubernetes://torchx/some-namespace:some-app
torchx 2025-10-31 15:29:30 INFO     AppStatus:
    State: UNKNOWN
    Num Restarts: -1
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: None
  • torchx status kubernetes://torchx/some-namespace:some-app:
AppStatus:
    State: PENDING
    Num Restarts: -1
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: None
  • torchx cancel kubernetes://torchx/some-namespace:some-app:
  • torchx status kubernetes://torchx/some-namespace:some-app: (Looks like k8s:Aborted is already mapped to CANCELLED!)
AppStatus:
    State: CANCELLED
    Num Restarts: -1
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: None
  • torchx delete kubernetes://torchx/some-namespace:some-app:
  • torchx status kubernetes://torchx/some-namespace:some-app:
torchx 2025-10-31 15:45:43 ERROR    AppDef: some-namespace:some-app, does not exist or has been removed from kubernetes's data plane
  • torchx delete kubernetes://torchx/some-namespace:some-app:
  • torchx cancel kubernetes://torchx/some-namespace:some-app:

If the above looks good I guess the only question is if we want to print anything when we try to delete and do it, or skip if it does not exists.

@clumsy clumsy force-pushed the feat/k8s_cancel_abort branch from a108deb to dd1fbec Compare October 31, 2025 15:49
@meta-codesync
Copy link

meta-codesync bot commented Oct 31, 2025

@kiukchung has imported this pull request. If you are a Meta employee, you can view this in D85961055.

@clumsy clumsy force-pushed the feat/k8s_cancel_abort branch from dd1fbec to 3dbf4b4 Compare October 31, 2025 18:37
@clumsy
Copy link
Collaborator Author

clumsy commented Oct 31, 2025

Updated the docstring, @kiukchung

@clumsy
Copy link
Collaborator Author

clumsy commented Oct 31, 2025

Hm, that Meta Internal-Only Changes Check is failing again, did I do something wrong, @kiukchung ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants