Skip to content

patch cancel bug: immediately mark predictions as cancelled #1798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 22, 2024

Conversation

technillogue
Copy link
Contributor

unconditionally mark predictions as cancelled without waiting for cancellation to succeed

while I struggle with #1786, we see queue spikes because not responding to cancellation promptly causes the pod to get restarted. this is a dirty hack to pretend like cancellation works immediately. as soon as we fix the race condition (and possibly any issues with task.cancel() behaving differently from signal handlers), we can drop this.

this is implemented by pretending as though the right event was read from the pipe

@technillogue technillogue requested a review from a team July 11, 2024 23:40
@technillogue technillogue force-pushed the syl/patch-cancel-bug branch 2 times, most recently from e3d1c78 to df1d8d0 Compare July 15, 2024 04:19
@nickstenning
Copy link
Contributor

I'm not sure how I feel about this, fwiw. By doing this we tell director that work has been stopped when it may not have been, and it seems to me there's a real risk that opens the door to errors later on that may be much harder to debug.

If we know for a fact that's not the case for the specific models we're trying to patch, is there a way we can address this in the model? Or only turn it on for those models?

@technillogue
Copy link
Contributor Author

it's definitely not possible to patch it in the model, and I think it should occur for any sufficiently high-throughput model (but we only had one or two of those). we can definitely gate this behind a feature flag, though

…cellation to succeed

while I struggle with #1786, we see queue spikes because not responding to cancellation promptly causes the pod to get restarted. this is a dirty hack to pretend like cancellation works immediately. as soon as we fix the race condition (and possibly any issues with task.cancel() behaving differently from signal handlers), we can drop this.

Signed-off-by: technillogue <technillogue@gmail.com>
this makes the cancel patch cleaner, but might be a small speedup for high-throughput outputs

Signed-off-by: technillogue <technillogue@gmail.com>
@technillogue technillogue force-pushed the syl/patch-cancel-bug branch from 568552a to db8542e Compare July 22, 2024 19:16
Copy link
Contributor

@mattt mattt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting that behind a flaggable envvar. Interested to see how this works in practice.

@technillogue technillogue merged commit 5464c43 into async Jul 22, 2024
9 checks passed
@technillogue technillogue deleted the syl/patch-cancel-bug branch July 22, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants