Skip to content

Conversation

@ikchifo
Copy link
Contributor

@ikchifo ikchifo commented Jan 18, 2026

Why are these changes needed?

Adds Azure Blob Storage support for the History Server, enabling Azure users to persist Ray cluster logs and events. Supports both connection string authentication (for dev/test) and Microsoft Entra Workload Identity (for production AKS deployments).

Configuration

Set --runtime-class-name=azureblob to enable this module.

Config via environment variables or /var/collector-config/data:

{"azureContainer": "", "azureConnectionString": "", "azureAccountURL": ""}
Variable Description
AZURE_STORAGE_CONNECTION_STRING Connection string (for dev/test)
AZURE_STORAGE_ACCOUNT_URL Account URL (for workload identity)
AZURE_STORAGE_CONTAINER Container name (default: ray-historyserver)
AZURE_STORAGE_AUTH_MODE Auth mode: connection_string, workload_identity, default

Related issue number

Closes #4398

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

End-to-End Testing on AKS with Workload Identity

This feature was tested on Azure Kubernetes Service (AKS) using Microsoft Entra Workload Identity for secure, credential-free authentication to Azure Blob Storage.

1. Infrastructure Setup

Environment Variables

export RESOURCE_GROUP="rg-kuberay-wi-test"
export LOCATION="eastus"
export CLUSTER_NAME="aks-kuberay-wi-test"
export IDENTITY_NAME="id-kuberay-historyserver"
export STORAGE_ACCOUNT_NAME="stkuberayhistory$(date +%s | tail -c 6)"
export CONTAINER_NAME="ray-historyserver"
export SERVICE_ACCOUNT_NAME="ray-collector"
export SERVICE_ACCOUNT_NAMESPACE="default"
export ACR_NAME="kuberaycollector"

# AKS Cluster with Workload Identity
az group create --name "${RESOURCE_GROUP}" --location "${LOCATION}"

az aks create \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER_NAME}" \
  --node-count 2 \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --generate-ssh-keys

az aks get-credentials --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}"

# User-Assigned Managed Identity

az identity create \
  --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --location "${LOCATION}"

export IDENTITY_CLIENT_ID=$(az identity show --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" --query "clientId" -o tsv)
export IDENTITY_PRINCIPAL_ID=$(az identity show --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" --query "principalId" -o tsv)

# Storage Account with RBAC

az storage account create \
  --name "${STORAGE_ACCOUNT_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --location "${LOCATION}" \
  --sku Standard_LRS

export STORAGE_ACCOUNT_ID=$(az storage account show \
  --name "${STORAGE_ACCOUNT_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --query "id" -o tsv)

az role assignment create \
  --assignee-object-id "${IDENTITY_PRINCIPAL_ID}" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope "${STORAGE_ACCOUNT_ID}"

# Federated Identity Credential

export AKS_OIDC_ISSUER=$(az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER_NAME}" \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

az identity federated-credential create \
  --name "fc-ray-collector" \
  --identity-name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --issuer "${AKS_OIDC_ISSUER}" \
  --subject "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_NAME}" \
  --audience "api://AzureADTokenExchange"

# Create ACR and attach to AKS
az acr create --resource-group "${RESOURCE_GROUP}" --name "${ACR_NAME}" --sku Basic
az aks update --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" --attach-acr "${ACR_NAME}"

# Build collector image
cd historyserver
az acr build --registry "${ACR_NAME}" --image collector:v0.1.0 . -f Dockerfile.collector

# Install KubeRay Operator

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1

# Create ServiceAccount with Workload Identity

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ray-collector
  namespace: default
  annotations:
    azure.workload.identity/client-id: "${IDENTITY_CLIENT_ID}"
EOF
2. Setup RayCluster with Azure Storage-backed Collector Sidecar
kubectl apply -f - <<EOF
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-wi-test
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    template:
      metadata:
        labels:
          azure.workload.identity/use: "true"
      spec:
        serviceAccountName: ray-collector
        containers:
        - name: ray-head
          image: rayproject/ray:2.50.1
          env:
          - name: RAY_enable_ray_event
            value: "true"
          - name: RAY_enable_core_worker_ray_event_to_aggregator
            value: "true"
          - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR
            value: "http://localhost:8084/v1/events"
          - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES
            value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT,TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT,ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT"
          resources:
            limits:
              cpu: "2"
              memory: 4Gi
            requests:
              cpu: "500m"
              memory: 1Gi
          lifecycle:
            postStart:
              exec:
                command:
                - /bin/sh
                - -c
                - |
                  while true; do
                    nodeid=\$(ps -ef | grep 'raylet.*--node_id' | grep -v grep | sed -n 's/.*--node_id=\([^ ]*\).*/\1/p' | head -1)
                    if [ -n "\$nodeid" ]; then
                      echo "\$nodeid" > /tmp/ray/raylet_node_id
                      break
                    fi
                    sleep 1
                  done
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        - name: collector
          image: kuberaycollector.azurecr.io/collector:v0.1.0
          env:
          - name: AZURE_STORAGE_ACCOUNT_URL
            value: "https://${STORAGE_ACCOUNT_NAME}.blob.core.windows.net"
          - name: AZURE_STORAGE_CONTAINER
            value: "ray-historyserver"
          command:
          - collector
          - --role=Head
          - --runtime-class-name=azureblob
          - --ray-cluster-name=raycluster-wi-test
          - --events-port=8084
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        volumes:
        - name: ray-logs
          emptyDir: {}
EOF

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l ray.io/cluster=raycluster-wi-test --timeout=120s

# Submit a Ray job to generate activity
kubectl exec -it $(kubectl get pods -l ray.io/cluster=raycluster-wi-test -o jsonpath='{.items[0].metadata.name}') \
  -c ray-head -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

# Delete the cluster to trigger log upload
kubectl delete raycluster raycluster-wi-test

# Check Azure Blob Storage for uploaded files
az storage blob list \
  --account-name "${STORAGE_ACCOUNT_NAME}" \
  --container-name ray-historyserver \
  --auth-mode login \
  --output table

# Cleanup
az group delete --name "${RESOURCE_GROUP}" --yes --no-wait
3. Screenshotss

Azure Storage Browser (with the uploaded session files from the Collector)

_ray_history_azure_blob_ss

Logs when new events were detected

_ray_history_new_events

Demo Video

azure_blob_recording_compressed.mp4

Note: E2E tests using Azurite will be added in a follow-up PR to keep this one focused on the core implementation.

Add Azure Blob Storage as a storage backend for the History Server,
implementing the StorageReader and StorageWriter interfaces.

This completes coverage for major cloud providers:
- AWS S3 (existing)
- Aliyun OSS (existing)
- GCS (in progress by Google team)
- Azure Blob Storage (this PR)

Features:
- Connection string authentication
- Auto-creation of container if not exists
- Full implementation of StorageReader/StorageWriter interfaces
- Unit tests using Azurite emulator
- Documentation with setup instructions

Related issue: ray-project#4398
…support

Add support for Microsoft Entra Workload Identity as an alternative to
connection string authentication for Azure Blob Storage. This enables
passwordless authentication for AKS-hosted collectors.

Changes:
- Add AuthMode configuration with connection_string, workload_identity,
  and default options
- Add AZURE_STORAGE_ACCOUNT_URL environment variable for token-based auth
- Add AZURE_STORAGE_AUTH_MODE for explicit auth mode selection
- Auto-detect authentication method when not explicitly specified
- Add azidentity dependency for DefaultAzureCredential support
- Update README with workload identity setup and troubleshooting guide
- Add missing LogFiles channel to match S3/AliyunOSS struct
- Simplify README to match project style (257 -> 50 lines)
- Add JSON config documentation for /var/collector-config/data
- Consolidate duplicate config code with populateFromEnvAndJSON()
- Remove unnecessary panic recovery blocks
- Simplify auth mode detection logic
Fix two issues causing "<no name>" display in Azure Storage Explorer:

1. Fix leading slash in metadir path construction (collector.go)
   - Changed `path.Clean(r.RootDir + "/" + "metadir")` to
     `path.Join(r.RootDir, "metadir")`
   - When RootDir is empty, the old code produced "/metadir" paths
     with an empty first segment that Azure displayed as "<no name>"

2. Make CreateDirectory a no-op for Azure Blob Storage (azureblob.go)
   - Azure Blob Storage doesn't require explicit directory markers
   - Virtual directories are automatically inferred from blob paths
   - The zero-byte marker blobs ending with "/" caused "<no name>"
     display issues in Azure Storage Explorer
@ikchifo ikchifo force-pushed the feature/azure-blob-storage branch 2 times, most recently from 3a15457 to c3ab6d1 Compare January 19, 2026 22:50
…uster

Enhance the documentation for setting up the collector with a Ray cluster by adding important configuration details. This includes instructions for sharing the `/tmp/ray` directory using an `emptyDir` volume and implementing a `postStart` lifecycle hook to write the raylet node ID to a file. These changes aim to prevent common errors related to session logs and node identification.
@ikchifo ikchifo force-pushed the feature/azure-blob-storage branch from c3ab6d1 to deb63c7 Compare January 19, 2026 22:52
Align Azure Blob Storage unit tests with the existing S3 and AliyunOSS
test patterns. The tests now focus on basic path manipulation rather
than full integration tests requiring Azurite.

E2E tests with Azurite will be added in a follow-up.
@ikchifo ikchifo marked this pull request as ready for review January 19, 2026 23:15
cursor[bot]

This comment was marked as outdated.

The Endpoint field was populated but never used. For Azure Blob Storage:
- Connection strings embed the endpoint (BlobEndpoint=...)
- AccountURL serves as the endpoint for token authentication

This removes the dead code to avoid user confusion.
cursor[bot]

This comment was marked as outdated.

Close resp.Body when DownloadStream fails to prevent connection leaks.
This matches the cleanup pattern used in the S3 implementation.
cursor[bot]

This comment was marked as outdated.

Replace UploadBuffer with UploadStream to avoid loading entire files
into memory. This matches the streaming approach used in S3/AliyunOSS
implementations and prevents memory exhaustion with large log files.
Create a fresh context for the retry download attempt in GetContent.
The original context may have expired or have minimal time remaining
after the listing operation completes.
- Add panic recovery to ListFiles and List methods to match S3/AliyunOSS
- Use safe type assertions in config.go to prevent panics on invalid JSON
…types

- Remove unused HttpClient field and net/http import
- Use azcore.ResponseError for error type checking instead of string
  matching when detecting ContainerAlreadyExists error
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Move io.ReadAll inside the retry block to ensure the stream is fully
read before the context is cancelled. The Azure SDK streams data from
the network, so cancelling the context before reading can cause
incomplete data or errors.
…ation

- Fix GetContent fallback to match full path instead of basename to
  avoid returning wrong content from nested directories with duplicate
  filenames. Also use delimiter to list only direct children.
- Add configurable timeout constants: uploadTimeout (5m), listTimeout
  (2m), downloadTimeout (10m) for large file handling
- Add parseAuthMode helper to normalize and validate azureAuthMode
  from JSON config (handles case-insensitivity and invalid values)
@ikchifo
Copy link
Contributor Author

ikchifo commented Jan 20, 2026

Ready for review. @Future-Outlier - Please take a look when you get the chance.
cc: @JiangJiaWei1103 @rueian as I a saw you've recently interacted with code for the history server.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this PR!
do you mind provide screenshots or a video to prove this really works in Azure Blob Storage?

@ikchifo
Copy link
Contributor Author

ikchifo commented Jan 20, 2026

thank you for this PR! do you mind provide screenshots or a video to prove this really works in Azure Blob Storage?

@Future-Outlier Updated the PR description with a video showing:

  • an initially empty Azure Blob Container (S3 bucket equivalent)
  • the collector sidecar logs showing that the azureblob runtimeclass is being used at start-up
  • submitting a test job to the RayCluster
  • showing events being emitted in the collecter sidecar after the job is submitted
  • delete the RayCluster to trigger an upload of the history-server files
  • show files uploaded in the Blob container

The manifests and additional details used to run the test can are also in the PR's description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[history server][storage] Add Azure Blob Storage support

2 participants