You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, note that the run tests that fail are due to failures in kill; see the examples above to see why.
This one perplexes me.
So, in the worker logs for the sharedFS run that failed here, we see the following:
2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525
Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.
From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).
Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).
So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable
The text was updated successfully, but these errors were encountered:
Kill/run
First, note that the run tests that fail are due to failures in kill; see the examples above to see why.
This one perplexes me.
So, in the worker logs for the sharedFS run that failed here, we see the following:
2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525
Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.
From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).
Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).
So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable
The text was updated successfully, but these errors were encountered: