Skip to content

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Sep 26, 2025

task.file_mounts_mapping is set when uploading a task to a remote API server to help the server to use the right files in the uploaded zip. This changes the translation code to remove this internal field from the translated task. Normally, this wouldn't matter, but in jobs the task may be sent in a future API request, which can cause issues.

If you call jobs.launch on a remote API server, the file mounts will be translated using file_mounts_mapping as with any other task submitted to a remote API server. This is expected.
However, later, we submit the same translated task YAML to the jobs controller, which will submit it back to an API server (either a local API server on the jobs controller in non-consolidation mode, or back to the same API server in consolidation mode). However, both of these calls will be to a "local" API server (local to the job controller process), so we will not do any file zipping/uploading/file mount translation. However, the old file_mount_mapping from the first translation will persist, even though it is now irrelevant.

The presence of the field will cause the API server to try translate the file mounts for this second API call, even though they should not be translated.

if not file_mounts_mapping:
# We did not mount any files to new paths on the remote server
# so no need to resolve filepaths.
continue
This will cause a a KeyError in process_mounts_in_task_on_api_server when the job controller is calling sdk.launch - here src/original_path will not exist in the file_mounts_mapping
file_mounts[dst] = _get_client_file_mounts_path(
src, file_mounts_mapping)
def _get_client_file_mounts_path(
original_path: str, file_mounts_mapping: Dict[str, str]) -> str:
return str(client_file_mounts_dir /
file_mounts_mapping[original_path].lstrip('/'))

Before #7051, this was not an issue because the job controller called core.launch directly instead of calling sdk.launch.

This is not an issue if you're not using a remote API server, because we won't upload files and write the file_mounts_mapping for a local API server.

This is not an issue if an intermediate bucket is used for the jobs controller. That will be used by default unless one of the following is true:

  • Consolidation mode is enabled
  • No cloud storage is available
  • jobs.force_disable_cloud_bucket is set to true in the config

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cg505 cg505 marked this pull request as draft September 26, 2025 01:00
@cg505
Copy link
Collaborator Author

cg505 commented Sep 26, 2025

/smoke-test --remote-server

@cg505 cg505 changed the title ????? [server] remove file_mounts_mapping after use Sep 26, 2025
@cg505 cg505 marked this pull request as ready for review September 26, 2025 02:06
@SeungjinYang
Copy link
Collaborator

I'm slightly concerned about the failing test_ignore_exclusions test since it does relate to file mounts - could you take a look?

@cg505
Copy link
Collaborator Author

cg505 commented Sep 26, 2025

/smoke-test --managed-jobs

@cg505
Copy link
Collaborator Author

cg505 commented Sep 26, 2025

Current hypothesis on the test failure:

  1. --remote-server tests that have LOW_CONTROLLER_RESOURCE_ENV are not actually using a remote server. The overridden config file does not have the API server, so it just uses a local API server.
  2. The launched remote jobs controller still gets the local config file with api_server.endpoint set, so sdk.launch on the jobs controller tries to use the remote API server. (?)

@cg505
Copy link
Collaborator Author

cg505 commented Sep 26, 2025

/smoke-test -k test_ignore_exclusions

@cg505
Copy link
Collaborator Author

cg505 commented Sep 27, 2025

/smoke-test -k test_ignore_exclusions --remote-server

@cg505
Copy link
Collaborator Author

cg505 commented Sep 27, 2025

Smoke tests pass after merging master, which makes me think that #7379 may have resolved that failure.
There is still the issue with the test locally, which seems unrelated.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?????

Thanks @cg505! Can we add unit test for the task YAML got by API server / jobs controller and further the API server submitted by jobs controller?

@cg505
Copy link
Collaborator Author

cg505 commented Sep 27, 2025

Can we add unit test for the task YAML got by API server / jobs controller and further the API server submitted by jobs controller?

Regarding smoke tests, this case should already be covered by using /smoke-test --remote-server --jobs-consolidation, but this is not currently supported. I will add a smoke test for force_disable_cloud_bucket.

Regarding unit tests, we can't easily test the integration flow you mentioned. But we can add some test on this function directly.

@kevinmingtarja
Copy link
Collaborator

Current hypothesis on the test failure:

  1. --remote-server tests that have LOW_CONTROLLER_RESOURCE_ENV are not actually using a remote server. The overridden config file does not have the API server, so it just uses a local API server.

I found another (might be the same underlying issue) issue with --remote-server not actually using a remote server while doing #7413. I had to add this commit d6b068e for it to work.

@cg505
Copy link
Collaborator Author

cg505 commented Sep 30, 2025

/smoke-test -k test_managed_jobs_force_disable_cloud_bucket
/smoke-test -k test_managed_jobs_force_disable_cloud_bucket --remote-server

Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests @cg505!

@cg505 cg505 enabled auto-merge (squash) September 30, 2025 18:19
@cg505 cg505 merged commit 4712341 into skypilot-org:master Sep 30, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants