Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Expose failure reason for spot jobs #1655

Merged
merged 39 commits into from
Feb 5, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 1, 2023

Changes in this PR:

  1. Add the FAILED_CONFIG failure type to the SpotStatus for the spot jobs that failed due to name length and other cloud configuration problems.
  2. Expose failure reason in sky spot queue -a

image

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky spot launch -n test-failure tests/test_yamls/failed_setup.yaml Job 42 above
    • sky spot launch -n test-failure exit 1 Job 45 above
    • Long cluster name, successfully fail directly when only GCP is enabled (mv ~/.aws ~/.aws.bk): sky spot launch -n a-cluster-name-is-longer-than-35-chars echo hi Job 44 above
    • Failure for cloud not enabled (mv ~/.aws ~/.aws.bk): sky spot launch -n test-no-resources --cloud aws --gpus A100:8 echo hi Job 43 above
    • Long cluster name, successfully skip the GCP: sky spot launch -n a-cluster-name-is-longer-than-35-chars echo hi
      image
    • Failure for no resources: sky spot launch -n test-no-resources --cloud aws --gpus A100:8 echo hi
      image
    • sky spot logs 47
      image
    • sky spot logs 2
      image
  • All smoke tests: pytest tests/test_smoke.py
  • rendered the doc locally.

@concretevitamin
Copy link
Member

(May only get to it tomorrow so tagging @infwinston too. Quick look: FAILED_CONFIG is a bit surprising to read. Do we have alternatives?)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Did not finish a complete pass. Left some design questions first.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/spot/controller.py Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Show resolved Hide resolved
sky/spot/spot_state.py Show resolved Hide resolved
sky/spot/spot_utils.py Show resolved Hide resolved
# The cluster name is too long.
raise exceptions.ResourcesUnavailableError(str(e)) from e
except exceptions.ResourcesUnavailableError as e:
if len(e.failover_reasons) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused by:

  • Why do we not need to catch InvalidClusterNameError anymore?
  • Why do we do if len(e.failover_reasons) == 1 in several places? What if then length is 0 or >1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not need to catch InvalidClusterNameError anymore?

The sky.launch will only raise ResourcesUnavailableError as the InvalidClusterNameError, i.e. the previous catching is not effective.

Why do we do if len(e.failover_reasons) == 1 in several places? What if then length is 0 or >1?

We only use the failover_reasons when len(e.failover_reasons) == 1, because in that case, the underlying failovers are due to the same reason, and we can aggregate those errors. However, for other cases, failover_type can be a mix of multiple reasons, e.g. ResourcesUnvailableError for AWS+ InvalidClusterNameError for GCP. We may want to retry the launch if that happens for trying to get available resources. I changed it to explicitly checking the failover_type to be one of the known errors that do not need to retry. Wdyt?

docs/source/examples/spot-jobs.rst Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
sky/spot/recovery_strategy.py Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Did a complete pass. Mostly related to comments.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Show resolved Hide resolved
@@ -143,13 +201,15 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
The job's submit timestamp, or None if failed to submit the job
(either provisioning fails or any error happens in job submission)
and raise_on_failure is False.

Raises:
exceptions.ResourceUnavailableError: If the launch fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exceptions.ResourceUnavailableError: If the launch fails.
exceptions.ResourceUnavailableError: If the launch fails after retries and `raise_on_failure` is True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't retry for the case when the ResourcesUnavailableError does not appear in the failover reasons.

sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - this is tricky to get right. LGTM.

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
Michaelvll and others added 7 commits February 3, 2023 16:35
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…evitamin/sky-experiments into fix-spot-status-for-cluster-name
sky/exceptions.py Outdated Show resolved Hide resolved
@@ -29,6 +29,18 @@ def with_failover_history(self, failover_history: List[Exception]) -> None:
return self


class SpotJobFailBeforeProvisionError(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what's the reason for adding this? I'm not sure if this is clearer than before. E.g., "before provision" may be ambiguous, because we may have encountered invalid name on one cloud, and resource unavailable on another. Does that count as "before provision"? The previous impl seems good enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case mentioned above should not happen when this error is raised. Here are the reasons I added this error:

  1. I found the previous implementation hard to distinguish the exception raised after maximum retry and the exception raised because the ResourcesUnavailableError has an empty failover_history. I think it would be better to explicitly distinguish the error.
  2. We will only raise the exception iff the optimizer cannot find the feasible resources, or none of the failover is because of resources unavailability. (
    if not any(
    isinstance(err, exceptions.ResourcesUnavailableError)
    for err in e.failover_history):
    # _launch() (this function) should fail/exit directly, if
    # none of the failover reasons were because of resource
    # unavailability or no failover was attempted (the optimizer
    # cannot find feasible resources for requested resources),
    # i.e., e.failover_history is empty.
    # Failing directly avoids the infinite loop of retrying
    # the launch when, e.g., an invalid cluster name is used
    # and --retry-until-up is specified.
    reason = common_utils.format_exception(e)
    if e.failover_history:
    reason = common_utils.format_exception(
    e.failover_history[0])
    logger.error(
    'Failure happened before provisioning. Failover '
    f'reasons: {reason}')
    if raise_on_failure:
    if e.failover_history:
    # Only use the first failover error, as it would be
    # too verbose to show all the failover errors, and
    # it is likely that the first failover error is
    # similar to the others, e.g. the cluster name is
    # invalid or cloud user identity is invalid, etc.
    reason_err = e.failover_history[0]
    else:
    reason_err = e
    raise exceptions.SpotJobFailBeforeProvisionError(
    reason=reason_err)
    return None
    )
  3. It would be quite verbose to check whether the failover_history contains ResourcesUnavailableError in both recovery_strategy.py and controller.py in the previous implementation.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some draft comments.

sky/spot/controller.py Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved


class SpotJobFailedBeforeProvisionError(Exception):
"""Raised when a spot job fails before provision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add #1655 (comment)? E.g.,

...

This is only raised by the spot controller process (`recovery_strategy`) when one of the following happens:

  - The optimizer cannot find feasible resources: e.g., this includes the case where a maximum number retries are attempted and the launch still failed, corresponding to the case above where the optimizer cannot find any more feasible resources. 
  - or none of the exceptions in failover history are because of resources unavailability returned from an actual provision request.



This exception differs from an ResourcesUnavailableError with an empty failover_history, because the latter will only happen when ....

Not sure these are correct; please check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need updates after offline discussions.

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Based on the offline discussion, I refactored the exceptions used for the spot code path. PTAL. : )

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a bunch of rewording suggestions. Thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/spot/spot_state.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit c2e0ff7 into master Feb 5, 2023
@Michaelvll Michaelvll deleted the fix-spot-status-for-cluster-name branch February 5, 2023 05:37
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Feb 22, 2023
* Expose failure reason for spot jobs

* Add failure reason for normal failure

* Failure reason hint for sky logs sky-spot-controller

* require failure reason for all

* Fix the conftest

* fix controller name

* revert SKYPILOT_USER

* Show controller processs logs with sky spot logs for better UX

* revert usage user ID

* do not overwrite failure reason for resource unavailable

* format

* lint

* address comments

* fix comment

* Update docs/source/examples/spot-jobs.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* improve readability and refactoring

* address comments

* format

* Add comment

* address comments

* format

* Add failover history to the error rasied by _launch

* Add comment

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* refactor

* Address comment

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* format

* format

* fix exception name

* refactor a bit

* Add more comments

* format

* fix

* fix logs

* adopt suggestions

* Fix rendering

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
* Expose failure reason for spot jobs

* Add failure reason for normal failure

* Failure reason hint for sky logs sky-spot-controller

* require failure reason for all

* Fix the conftest

* fix controller name

* revert SKYPILOT_USER

* Show controller processs logs with sky spot logs for better UX

* revert usage user ID

* do not overwrite failure reason for resource unavailable

* format

* lint

* address comments

* fix comment

* Update docs/source/examples/spot-jobs.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* improve readability and refactoring

* address comments

* format

* Add comment

* address comments

* format

* Add failover history to the error rasied by _launch

* Add comment

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* refactor

* Address comment

* Update sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* format

* format

* fix exception name

* refactor a bit

* Add more comments

* format

* fix

* fix logs

* adopt suggestions

* Fix rendering

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants