Kill/run test issues #4464

epicfaace · 2023-05-03T20:59:16Z

Kill/run

First, note that the run tests that fail are due to failures in kill; see the examples above to see why.

This one perplexes me.

So, in the worker logs for the sharedFS run that failed here, we see the following:

2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525
Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.

From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).

Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).

So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable

epicfaace added the p1 Do it in the next two weeks. label May 3, 2023

epicfaace assigned AndrewJGaut and epicfaace May 3, 2023

epicfaace added p2 Do it this quarter. and removed p1 Do it in the next two weeks. labels Jun 14, 2023

epicfaace removed their assignment Jun 14, 2023

epicfaace added p1 Do it in the next two weeks. and removed p2 Do it this quarter. labels Jun 14, 2023

percyliang unassigned AndrewJGaut Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill/run test issues #4464

Kill/run test issues #4464

epicfaace commented May 3, 2023

Kill/run test issues #4464

Kill/run test issues #4464

Comments

epicfaace commented May 3, 2023