Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill/run test issues #4464

Open
epicfaace opened this issue May 3, 2023 · 0 comments
Open

Kill/run test issues #4464

epicfaace opened this issue May 3, 2023 · 0 comments
Labels
p1 Do it in the next two weeks.

Comments

@epicfaace
Copy link
Member

Kill/run

First, note that the run tests that fail are due to failures in kill; see the examples above to see why.

This one perplexes me.

So, in the worker logs for the sharedFS run that failed here, we see the following:

2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525
Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.

From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).

Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).

So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable

@epicfaace epicfaace added the p1 Do it in the next two weeks. label May 3, 2023
@epicfaace epicfaace added p2 Do it this quarter. and removed p1 Do it in the next two weeks. labels Jun 14, 2023
@epicfaace epicfaace removed their assignment Jun 14, 2023
@epicfaace epicfaace added p1 Do it in the next two weeks. and removed p2 Do it this quarter. labels Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Do it in the next two weeks.
Projects
None yet
Development

No branches or pull requests

2 participants