Skip to content

Conversation

@andrew-anyscale
Copy link
Contributor

@andrew-anyscale andrew-anyscale commented Jan 30, 2026

For the Wanda version of building each Ray Image, we want to selectively upload to DockerHub only in certain postsubmit cases. Rather than calling on 'buildkite:copy_files --destination docker_login' separately, this refactor pushes the auth setup to a central library that can be called on certain conditions.

Without this, logic would need to be pushed into the Rayci step itself, since current login is gated on a per-Pipeline-ID basis.

Topic: auth-refactor
Relative: upload-guards
Signed-off-by: andrew andrew@anyscale.com

@andrew-anyscale
Copy link
Contributor Author

Reviews in this chain:
#60612 [ci] add upload guards and multi-platform support to push_ray_image
 └#60614 [ci] refactor auth out of copy_files
  └#60455 [ci]: port Buildkite Ray Images to Wanda
   ├#59937 [ci] Add wanda anyscale image builds for release tests
   └#60539 Consolidate RAY_VERSION to use rayci.env as single source of truth

@andrew-anyscale
Copy link
Contributor Author

andrew-anyscale commented Jan 30, 2026

# head base diff date summary
0 68188ccb 2195731c diff Jan 30 10:19 AM 6 files changed, 103 insertions(+), 48 deletions(-)
1 7452f4cf 2195731c diff Jan 30 10:46 AM 1 file changed, 23 insertions(+), 8 deletions(-)
2 54061a8c e8e58b5f rebase Jan 30 10:58 AM 0 files changed
3 9fc08d89 26fe1233 diff Jan 30 10:59 AM 0 files changed

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors authentication logic into a new centralized library, ci/ray_ci/rayci_auth.py, which is a great improvement for code organization and reusability. The new implementation for Docker Hub login also enhances security by using --password-stdin, preventing password exposure in the process list. My review focuses on the new rayci_auth.py file, offering suggestions to improve its robustness, maintainability, and error handling.

Comment on lines 21 to 45
def _retry(f):
"""Retry decorator for API calls with exponential backoff on 5xx errors."""

def inner():
resp = None
for _ in range(5):
resp = f()
print("Getting Presigned URL, status_code", resp.status_code)
if resp.status_code >= 500:
print("errored, retrying...")
print(resp.text)
time.sleep(5)
else:
return resp
if resp is None or resp.status_code >= 500:
print("still errored after many retries")
sys.exit(1)

return inner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _retry decorator can be improved. As this logic is now in a shared library, making it more robust is important.

  • Docstring Mismatch: The docstring claims "exponential backoff", but the implementation uses a fixed time.sleep(5).
  • Exiting the process: The function calls sys.exit(1) on failure. In a library module, it's better to raise an exception to allow the caller to handle the error. This makes the function more reusable.
Suggested change
def _retry(f):
"""Retry decorator for API calls with exponential backoff on 5xx errors."""
def inner():
resp = None
for _ in range(5):
resp = f()
print("Getting Presigned URL, status_code", resp.status_code)
if resp.status_code >= 500:
print("errored, retrying...")
print(resp.text)
time.sleep(5)
else:
return resp
if resp is None or resp.status_code >= 500:
print("still errored after many retries")
sys.exit(1)
return inner
def _retry(f):
"""Retry decorator for API calls with a fixed-interval retry on 5xx errors."""
def inner():
resp = None
for _ in range(5):
resp = f()
print("Getting Presigned URL, status_code", resp.status_code)
if resp.status_code >= 500:
print("errored, retrying...")
print(resp.text)
time.sleep(5)
else:
return resp
if resp is None or resp.status_code >= 500:
print("still errored after many retries")
raise RuntimeError("Failed to get a valid response from RayCI API after multiple retries.")
return inner

Comment on lines 55 to 56
aws_region="us-west-2",
aws_service="execute-api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The AWS region and service are hardcoded here. To improve maintainability and configurability, consider defining them as module-level constants, similar to RAYCI_API_HOST. You would then define RAYCI_API_REGION and RAYCI_API_SERVICE at the top of the file.

Suggested change
aws_region="us-west-2",
aws_service="execute-api",
aws_region=RAYCI_API_REGION,
aws_service=RAYCI_API_SERVICE,

Comment on lines 58 to 72
resp = requests.get(
RAYCI_API_URL,
auth=auth,
params={"job_id": os.environ["BUILDKITE_JOB_ID"]},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Direct access to os.environ["BUILDKITE_JOB_ID"] will raise a KeyError if the environment variable is not set. It's better practice to use .get() and raise a more informative ValueError if it's missing.

Suggested change
resp = requests.get(
RAYCI_API_URL,
auth=auth,
params={"job_id": os.environ["BUILDKITE_JOB_ID"]},
)
job_id = os.environ.get("BUILDKITE_JOB_ID")
if not job_id:
raise ValueError("BUILDKITE_JOB_ID environment variable must be set.")
resp = requests.get(
RAYCI_API_URL,
auth=auth,
params={"job_id": job_id},
)

Fetches credentials and runs `docker login` for the rayproject account.
"""
resp = get_rayci_api_response()
pwd = resp.json()["docker_password"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line can raise a KeyError or requests.exceptions.JSONDecodeError if the API response is not as expected. It's safer to wrap this in a try...except block to handle potential errors gracefully and provide a more informative error message.

    try:
        pwd = resp.json()["docker_password"]
    except (requests.exceptions.JSONDecodeError, KeyError):
        raise ValueError(f"Could not get docker password from API. Response: {resp.text}")

@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/auth-refactor branch from 68188cc to 7452f4c Compare January 30, 2026 18:46
For the Wanda version of building each Ray Image, we want to selectively upload to DockerHub only in certain postsubmit cases. Rather than calling on 'buildkite:copy_files --destination docker_login' separately, this refactor pushes the auth setup to a central library that can be called on certain conditions.

Without this, logic would need to be pushed into the Rayci step itself, since current login is gated on a per-Pipeline-ID basis.

Topic: auth-refactor
Relative: upload-guards
Signed-off-by: andrew <andrew@anyscale.com>
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/upload-guards branch from 2195731 to e8e58b5 Compare January 30, 2026 18:58
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/auth-refactor branch 2 times, most recently from 54061a8 to 9fc08d8 Compare January 30, 2026 18:59
@andrew-anyscale andrew-anyscale force-pushed the andrew/revup/master/upload-guards branch from e8e58b5 to 26fe123 Compare January 30, 2026 18:59
@andrew-anyscale andrew-anyscale marked this pull request as ready for review January 30, 2026 18:59
@andrew-anyscale andrew-anyscale requested a review from a team as a code owner January 30, 2026 18:59
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

subprocess.check_call(
["docker", "login", "--username", DOCKER_HUB_USERNAME, "--password-stdin"],
input=pwd.encode(),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.check_call does not support input parameter

High Severity

The docker_hub_login function uses subprocess.check_call() with an input parameter, but check_call doesn't support input. This parameter is only available in subprocess.run(). When called, this will raise a TypeError because the input kwarg gets passed to Popen, which doesn't accept it. The function needs to use subprocess.run(..., check=True) instead.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the devprod label Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants