-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Expose failure reason for spot jobs #1655
Conversation
(May only get to it tomorrow so tagging @infwinston too. Quick look: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Did not finish a complete pass. Left some design questions first.
sky/spot/recovery_strategy.py
Outdated
# The cluster name is too long. | ||
raise exceptions.ResourcesUnavailableError(str(e)) from e | ||
except exceptions.ResourcesUnavailableError as e: | ||
if len(e.failover_reasons) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit confused by:
- Why do we not need to catch InvalidClusterNameError anymore?
- Why do we do
if len(e.failover_reasons) == 1
in several places? What if then length is 0 or >1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not need to catch InvalidClusterNameError anymore?
The sky.launch
will only raise ResourcesUnavailableError
as the InvalidClusterNameError
, i.e. the previous catching is not effective.
Why do we do if len(e.failover_reasons) == 1 in several places? What if then length is 0 or >1?
We only use the failover_reasons
when len(e.failover_reasons) == 1
, because in that case, the underlying failovers are due to the same reason, and we can aggregate those errors. However, for other cases, failover_type can be a mix of multiple reasons, e.g. ResourcesUnvailableError
for AWS+ InvalidClusterNameError
for GCP. We may want to retry the launch if that happens for trying to get available resources. I changed it to explicitly checking the failover_type
to be one of the known errors that do not need to retry. Wdyt?
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…evitamin/sky-experiments into fix-spot-status-for-cluster-name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Did a complete pass. Mostly related to comments.
sky/spot/recovery_strategy.py
Outdated
@@ -143,13 +201,15 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]: | |||
The job's submit timestamp, or None if failed to submit the job | |||
(either provisioning fails or any error happens in job submission) | |||
and raise_on_failure is False. | |||
|
|||
Raises: | |||
exceptions.ResourceUnavailableError: If the launch fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exceptions.ResourceUnavailableError: If the launch fails. | |
exceptions.ResourceUnavailableError: If the launch fails after retries and `raise_on_failure` is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't retry for the case when the ResourcesUnavailableError
does not appear in the failover reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - this is tricky to get right. LGTM.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…evitamin/sky-experiments into fix-spot-status-for-cluster-name
sky/exceptions.py
Outdated
@@ -29,6 +29,18 @@ def with_failover_history(self, failover_history: List[Exception]) -> None: | |||
return self | |||
|
|||
|
|||
class SpotJobFailBeforeProvisionError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what's the reason for adding this? I'm not sure if this is clearer than before. E.g., "before provision" may be ambiguous, because we may have encountered invalid name on one cloud, and resource unavailable on another. Does that count as "before provision"? The previous impl seems good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case mentioned above should not happen when this error is raised. Here are the reasons I added this error:
- I found the previous implementation hard to distinguish the exception raised after maximum retry and the exception raised because the
ResourcesUnavailableError
has an emptyfailover_history
. I think it would be better to explicitly distinguish the error. - We will only raise the exception iff the optimizer cannot find the feasible resources, or none of the failover is because of resources unavailability. (
skypilot/sky/spot/recovery_strategy.py
Lines 233 to 263 in 1bf7993
if not any( isinstance(err, exceptions.ResourcesUnavailableError) for err in e.failover_history): # _launch() (this function) should fail/exit directly, if # none of the failover reasons were because of resource # unavailability or no failover was attempted (the optimizer # cannot find feasible resources for requested resources), # i.e., e.failover_history is empty. # Failing directly avoids the infinite loop of retrying # the launch when, e.g., an invalid cluster name is used # and --retry-until-up is specified. reason = common_utils.format_exception(e) if e.failover_history: reason = common_utils.format_exception( e.failover_history[0]) logger.error( 'Failure happened before provisioning. Failover ' f'reasons: {reason}') if raise_on_failure: if e.failover_history: # Only use the first failover error, as it would be # too verbose to show all the failover errors, and # it is likely that the first failover error is # similar to the others, e.g. the cluster name is # invalid or cloud user identity is invalid, etc. reason_err = e.failover_history[0] else: reason_err = e raise exceptions.SpotJobFailBeforeProvisionError( reason=reason_err) return None - It would be quite verbose to check whether the
failover_history
containsResourcesUnavailableError
in bothrecovery_strategy.py
andcontroller.py
in the previous implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some draft comments.
sky/exceptions.py
Outdated
|
||
|
||
class SpotJobFailedBeforeProvisionError(Exception): | ||
"""Raised when a spot job fails before provision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add #1655 (comment)? E.g.,
...
This is only raised by the spot controller process (`recovery_strategy`) when one of the following happens:
- The optimizer cannot find feasible resources: e.g., this includes the case where a maximum number retries are attempted and the launch still failed, corresponding to the case above where the optimizer cannot find any more feasible resources.
- or none of the exceptions in failover history are because of resources unavailability returned from an actual provision request.
This exception differs from an ResourcesUnavailableError with an empty failover_history, because the latter will only happen when ....
Not sure these are correct; please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need updates after offline discussions.
Based on the offline discussion, I refactored the exceptions used for the spot code path. PTAL. : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a bunch of rewording suggestions. Thanks @Michaelvll.
* Expose failure reason for spot jobs * Add failure reason for normal failure * Failure reason hint for sky logs sky-spot-controller * require failure reason for all * Fix the conftest * fix controller name * revert SKYPILOT_USER * Show controller processs logs with sky spot logs for better UX * revert usage user ID * do not overwrite failure reason for resource unavailable * format * lint * address comments * fix comment * Update docs/source/examples/spot-jobs.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * improve readability and refactoring * address comments * format * Add comment * address comments * format * Add failover history to the error rasied by _launch * Add comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * refactor * Address comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * format * fix exception name * refactor a bit * Add more comments * format * fix * fix logs * adopt suggestions * Fix rendering --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* Expose failure reason for spot jobs * Add failure reason for normal failure * Failure reason hint for sky logs sky-spot-controller * require failure reason for all * Fix the conftest * fix controller name * revert SKYPILOT_USER * Show controller processs logs with sky spot logs for better UX * revert usage user ID * do not overwrite failure reason for resource unavailable * format * lint * address comments * fix comment * Update docs/source/examples/spot-jobs.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * improve readability and refactoring * address comments * format * Add comment * address comments * format * Add failover history to the error rasied by _launch * Add comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * refactor * Address comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * format * fix exception name * refactor a bit * Add more comments * format * fix * fix logs * adopt suggestions * Fix rendering --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Changes in this PR:
FAILED_CONFIG
failure type to theSpotStatus
for the spot jobs that failed due to name length and other cloud configuration problems.sky spot queue -a
Tested (run the relevant ones):
sky spot launch -n test-failure tests/test_yamls/failed_setup.yaml
Job 42 abovesky spot launch -n test-failure exit 1
Job 45 abovemv ~/.aws ~/.aws.bk
):sky spot launch -n a-cluster-name-is-longer-than-35-chars echo hi
Job 44 abovemv ~/.aws ~/.aws.bk
):sky spot launch -n test-no-resources --cloud aws --gpus A100:8 echo hi
Job 43 abovesky spot launch -n a-cluster-name-is-longer-than-35-chars echo hi
sky spot launch -n test-no-resources --cloud aws --gpus A100:8 echo hi
sky spot logs 47
sky spot logs 2
pytest tests/test_smoke.py