[Spot] Expose failure reason for spot jobs #1655

Michaelvll · 2023-02-01T09:39:12Z

Changes in this PR:

Add the FAILED_CONFIG failure type to the SpotStatus for the spot jobs that failed due to name length and other cloud configuration problems.
Expose failure reason in sky spot queue -a

Tested (run the relevant ones):

concretevitamin · 2023-02-01T20:22:40Z

(May only get to it tomorrow so tagging @infwinston too. Quick look: FAILED_CONFIG is a bit surprising to read. Do we have alternatives?)

concretevitamin

Thanks @Michaelvll. Did not finish a complete pass. Left some design questions first.

sky/cli.py

sky/spot/controller.py

sky/spot/spot_utils.py

sky/spot/spot_state.py

sky/spot/spot_utils.py

sky/spot/spot_state.py

sky/spot/spot_utils.py

concretevitamin · 2023-02-02T16:27:44Z

sky/spot/recovery_strategy.py

-                # The cluster name is too long.
-                raise exceptions.ResourcesUnavailableError(str(e)) from e
+            except exceptions.ResourcesUnavailableError as e:
+                if len(e.failover_reasons) == 1:


A bit confused by:

Why do we not need to catch InvalidClusterNameError anymore?

Why do we do if len(e.failover_reasons) == 1 in several places? What if then length is 0 or >1?

Why do we not need to catch InvalidClusterNameError anymore?

The sky.launch will only raise ResourcesUnavailableError as the InvalidClusterNameError, i.e. the previous catching is not effective.

Why do we do if len(e.failover_reasons) == 1 in several places? What if then length is 0 or >1?

We only use the failover_reasons when len(e.failover_reasons) == 1, because in that case, the underlying failovers are due to the same reason, and we can aggregate those errors. However, for other cases, failover_type can be a mix of multiple reasons, e.g. ResourcesUnvailableError for AWS+ InvalidClusterNameError for GCP. We may want to retry the launch if that happens for trying to get available resources. I changed it to explicitly checking the failover_type to be one of the known errors that do not need to retry. Wdyt?

docs/source/examples/spot-jobs.rst

sky/spot/spot_utils.py

sky/backends/cloud_vm_ray_backend.py

sky/exceptions.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

sky/spot/recovery_strategy.py

sky/spot/controller.py

sky/backends/cloud_vm_ray_backend.py

sky/exceptions.py

sky/spot/controller.py

…evitamin/sky-experiments into fix-spot-status-for-cluster-name

concretevitamin

Thanks @Michaelvll! Did a complete pass. Mostly related to comments.

sky/backends/cloud_vm_ray_backend.py

sky/clouds/cloud.py

sky/exceptions.py

sky/spot/controller.py

sky/spot/recovery_strategy.py

concretevitamin · 2023-02-03T16:55:53Z

sky/spot/recovery_strategy.py

@@ -143,13 +201,15 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
            The job's submit timestamp, or None if failed to submit the job
            (either provisioning fails or any error happens in job submission)
            and raise_on_failure is False.
+
+        Raises:
+            exceptions.ResourceUnavailableError: If the launch fails.


Suggested change

exceptions.ResourceUnavailableError: If the launch fails.

exceptions.ResourceUnavailableError: If the launch fails after retries and `raise_on_failure` is True.

We don't retry for the case when the ResourcesUnavailableError does not appear in the failover reasons.

sky/spot/recovery_strategy.py

concretevitamin

Thanks @Michaelvll - this is tricky to get right. LGTM.

sky/spot/controller.py

sky/spot/recovery_strategy.py

sky/spot/controller.py

sky/spot/recovery_strategy.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

…evitamin/sky-experiments into fix-spot-status-for-cluster-name

sky/exceptions.py

concretevitamin · 2023-02-04T05:54:15Z

sky/exceptions.py

@@ -29,6 +29,18 @@ def with_failover_history(self, failover_history: List[Exception]) -> None:
        return self


+class SpotJobFailBeforeProvisionError(Exception):


Curious what's the reason for adding this? I'm not sure if this is clearer than before. E.g., "before provision" may be ambiguous, because we may have encountered invalid name on one cloud, and resource unavailable on another. Does that count as "before provision"? The previous impl seems good enough.

The case mentioned above should not happen when this error is raised. Here are the reasons I added this error:

I found the previous implementation hard to distinguish the exception raised after maximum retry and the exception raised because the ResourcesUnavailableError has an empty failover_history. I think it would be better to explicitly distinguish the error.

We will only raise the exception iff the optimizer cannot find the feasible resources, or none of the failover is because of resources unavailability. (

skypilot/sky/spot/recovery_strategy.py

Lines 233 to 263 in 1bf7993

if not any(

isinstance(err, exceptions.ResourcesUnavailableError)

for err in e.failover_history):

# _launch() (this function) should fail/exit directly, if

# none of the failover reasons were because of resource

# unavailability or no failover was attempted (the optimizer

# cannot find feasible resources for requested resources),

# i.e., e.failover_history is empty.

# Failing directly avoids the infinite loop of retrying

# the launch when, e.g., an invalid cluster name is used

# and --retry-until-up is specified.

reason = common_utils.format_exception(e)

if e.failover_history:

reason = common_utils.format_exception(

e.failover_history[0])

logger.error(

'Failure happened before provisioning. Failover '

f'reasons: {reason}')

if raise_on_failure:

if e.failover_history:

# Only use the first failover error, as it would be

# too verbose to show all the failover errors, and

# it is likely that the first failover error is

# similar to the others, e.g. the cluster name is

# invalid or cloud user identity is invalid, etc.

reason_err = e.failover_history[0]

else:

reason_err = e

raise exceptions.SpotJobFailBeforeProvisionError(

reason=reason_err)

return None

)

It would be quite verbose to check whether the failover_history contains ResourcesUnavailableError in both recovery_strategy.py and controller.py in the previous implementation.

concretevitamin

Some draft comments.

sky/spot/controller.py

sky/exceptions.py

concretevitamin · 2023-02-04T16:56:32Z

sky/exceptions.py

+
+
+class SpotJobFailedBeforeProvisionError(Exception):
+    """Raised when a spot job fails before provision.


Add #1655 (comment)? E.g.,

... This is only raised by the spot controller process (`recovery_strategy`) when one of the following happens: - The optimizer cannot find feasible resources: e.g., this includes the case where a maximum number retries are attempted and the launch still failed, corresponding to the case above where the optimizer cannot find any more feasible resources. - or none of the exceptions in failover history are because of resources unavailability returned from an actual provision request. This exception differs from an ResourcesUnavailableError with an empty failover_history, because the latter will only happen when ....

Not sure these are correct; please check.

May need updates after offline discussions.

sky/spot/controller.py

Michaelvll · 2023-02-04T23:49:48Z

Based on the offline discussion, I refactored the exceptions used for the spot code path. PTAL. : )

concretevitamin

LGTM! Just a bunch of rewording suggestions. Thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py

sky/spot/spot_state.py

sky/execution.py

sky/spot/controller.py

sky/spot/recovery_strategy.py

sky/exceptions.py

* Expose failure reason for spot jobs * Add failure reason for normal failure * Failure reason hint for sky logs sky-spot-controller * require failure reason for all * Fix the conftest * fix controller name * revert SKYPILOT_USER * Show controller processs logs with sky spot logs for better UX * revert usage user ID * do not overwrite failure reason for resource unavailable * format * lint * address comments * fix comment * Update docs/source/examples/spot-jobs.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * improve readability and refactoring * address comments * format * Add comment * address comments * format * Add failover history to the error rasied by _launch * Add comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * refactor * Address comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * format * fix exception name * refactor a bit * Add more comments * format * fix * fix logs * adopt suggestions * Fix rendering --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Michaelvll added 3 commits February 1, 2023 01:13

Expose failure reason for spot jobs

4ae4a2a

Add failure reason for normal failure

f88a408

Failure reason hint for sky logs sky-spot-controller

3786936

concretevitamin self-requested a review February 1, 2023 17:46

Michaelvll added 4 commits February 1, 2023 11:20

require failure reason for all

fbf720d

Fix the conftest

bd75460

fix controller name

585b268

revert SKYPILOT_USER

b90bf51

concretevitamin requested a review from infwinston February 1, 2023 20:22

Michaelvll added 5 commits February 1, 2023 14:05

Show controller processs logs with sky spot logs for better UX

8aa176b

revert usage user ID

7ab8355

do not overwrite failure reason for resource unavailable

c6c0f45

format

5318c43

lint

a609901

concretevitamin reviewed Feb 2, 2023

View reviewed changes

Michaelvll added 2 commits February 2, 2023 11:59

address comments

bda6c72

fix comment

4516596

concretevitamin reviewed Feb 3, 2023

View reviewed changes

Update docs/source/examples/spot-jobs.rst

fcbabd2

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

concretevitamin reviewed Feb 3, 2023

View reviewed changes

Michaelvll added 5 commits February 3, 2023 00:08

improve readability and refactoring

d034b0b

Merge branch 'fix-spot-status-for-cluster-name' of github.com:concret…

cb501a7

…evitamin/sky-experiments into fix-spot-status-for-cluster-name

address comments

555f0ec

format

5c2d005

Add comment

903deb9

concretevitamin reviewed Feb 3, 2023

View reviewed changes

Michaelvll added 3 commits February 3, 2023 13:37

address comments

11ef41c

format

0ee6703

Add failover history to the error rasied by _launch

78696c6

Add comment

b054f10

concretevitamin approved these changes Feb 3, 2023

View reviewed changes

Michaelvll and others added 7 commits February 3, 2023 16:35

Update sky/spot/recovery_strategy.py

cb773ab

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

refactor

087c31b

Address comment

5d6723b

Update sky/spot/recovery_strategy.py

d3726ed

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

format

f524a41

Merge branch 'fix-spot-status-for-cluster-name' of github.com:concret…

1deab6f

…evitamin/sky-experiments into fix-spot-status-for-cluster-name

format

1bf7993

concretevitamin reviewed Feb 4, 2023

View reviewed changes

fix exception name

5b43ae9

concretevitamin reviewed Feb 4, 2023

View reviewed changes

Michaelvll added 4 commits February 4, 2023 15:42

refactor a bit

c409b66

Add more comments

629b2d4

format

ccf8469

fix

14b1b74

concretevitamin approved these changes Feb 5, 2023

View reviewed changes

Michaelvll added 3 commits February 4, 2023 17:49

fix logs

f2ac7e9

adopt suggestions

6505fd2

Fix rendering

47be483

Michaelvll merged commit c2e0ff7 into master Feb 5, 2023

Michaelvll deleted the fix-spot-status-for-cluster-name branch February 5, 2023 05:37

Michaelvll mentioned this pull request Feb 23, 2023

[Spot] Fix spot failure reason when cloud is specified #1714

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Expose failure reason for spot jobs #1655

[Spot] Expose failure reason for spot jobs #1655

Michaelvll commented Feb 1, 2023 •

edited

Loading

concretevitamin commented Feb 1, 2023

concretevitamin left a comment

concretevitamin Feb 2, 2023

Michaelvll Feb 2, 2023

concretevitamin left a comment

concretevitamin Feb 3, 2023

Michaelvll Feb 3, 2023

concretevitamin left a comment

concretevitamin Feb 4, 2023

Michaelvll Feb 4, 2023

concretevitamin left a comment

concretevitamin Feb 4, 2023

concretevitamin Feb 4, 2023

Michaelvll commented Feb 4, 2023

concretevitamin left a comment

	exceptions.ResourceUnavailableError: If the launch fails.
	exceptions.ResourceUnavailableError: If the launch fails after retries and `raise_on_failure` is True.

		@@ -29,6 +29,18 @@ def with_failover_history(self, failover_history: List[Exception]) -> None:
		return self


		class SpotJobFailBeforeProvisionError(Exception):

	if not any(
	isinstance(err, exceptions.ResourcesUnavailableError)
	for err in e.failover_history):
	# _launch() (this function) should fail/exit directly, if
	# none of the failover reasons were because of resource
	# unavailability or no failover was attempted (the optimizer
	# cannot find feasible resources for requested resources),
	# i.e., e.failover_history is empty.
	# Failing directly avoids the infinite loop of retrying
	# the launch when, e.g., an invalid cluster name is used
	# and --retry-until-up is specified.
	reason = common_utils.format_exception(e)
	if e.failover_history:
	reason = common_utils.format_exception(
	e.failover_history[0])
	logger.error(
	'Failure happened before provisioning. Failover '
	f'reasons: {reason}')
	if raise_on_failure:
	if e.failover_history:
	# Only use the first failover error, as it would be
	# too verbose to show all the failover errors, and
	# it is likely that the first failover error is
	# similar to the others, e.g. the cluster name is
	# invalid or cloud user identity is invalid, etc.
	reason_err = e.failover_history[0]
	else:
	reason_err = e
	raise exceptions.SpotJobFailBeforeProvisionError(
	reason=reason_err)
	return None



		class SpotJobFailedBeforeProvisionError(Exception):
		"""Raised when a spot job fails before provision.

[Spot] Expose failure reason for spot jobs #1655

[Spot] Expose failure reason for spot jobs #1655

Conversation

Michaelvll commented Feb 1, 2023 • edited Loading

concretevitamin commented Feb 1, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Feb 4, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Feb 1, 2023 •

edited

Loading