[history server][storage] Add Azure Blob Storage support #4413

ikchifo · 2026-01-18T17:08:33Z

Why are these changes needed?

Adds Azure Blob Storage support for the History Server, enabling Azure users to persist Ray cluster logs and events. Supports both connection string authentication (for dev/test) and Microsoft Entra Workload Identity (for production AKS deployments).

Configuration

Set --runtime-class-name=azureblob to enable this module.

Config via environment variables or /var/collector-config/data:

{"azureContainer": "", "azureConnectionString": "", "azureAccountURL": ""}

Variable	Description
`AZURE_STORAGE_CONNECTION_STRING`	Connection string (for dev/test)
`AZURE_STORAGE_ACCOUNT_URL`	Account URL (for workload identity)
`AZURE_STORAGE_CONTAINER`	Container name (default: `ray-historyserver`)
`AZURE_STORAGE_AUTH_MODE`	Auth mode: `connection_string`, `workload_identity`, `default`

Related issue number

Closes #4398

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

End-to-End Testing on AKS with Workload Identity

This feature was tested on Azure Kubernetes Service (AKS) using Microsoft Entra Workload Identity for secure, credential-free authentication to Azure Blob Storage.

1. Infrastructure Setup

Environment Variables

export RESOURCE_GROUP="rg-kuberay-wi-test"
export LOCATION="eastus"
export CLUSTER_NAME="aks-kuberay-wi-test"
export IDENTITY_NAME="id-kuberay-historyserver"
export STORAGE_ACCOUNT_NAME="stkuberayhistory$(date +%s | tail -c 6)"
export CONTAINER_NAME="ray-historyserver"
export SERVICE_ACCOUNT_NAME="ray-collector"
export SERVICE_ACCOUNT_NAMESPACE="default"
export ACR_NAME="kuberaycollector"

# AKS Cluster with Workload Identity
az group create --name "${RESOURCE_GROUP}" --location "${LOCATION}"

az aks create \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER_NAME}" \
  --node-count 2 \
  --enable-oidc-issuer \
  --enable-workload-identity \
  --generate-ssh-keys

az aks get-credentials --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}"

# User-Assigned Managed Identity

az identity create \
  --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --location "${LOCATION}"

export IDENTITY_CLIENT_ID=$(az identity show --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" --query "clientId" -o tsv)
export IDENTITY_PRINCIPAL_ID=$(az identity show --name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" --query "principalId" -o tsv)

# Storage Account with RBAC

az storage account create \
  --name "${STORAGE_ACCOUNT_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --location "${LOCATION}" \
  --sku Standard_LRS

export STORAGE_ACCOUNT_ID=$(az storage account show \
  --name "${STORAGE_ACCOUNT_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --query "id" -o tsv)

az role assignment create \
  --assignee-object-id "${IDENTITY_PRINCIPAL_ID}" \
  --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" \
  --scope "${STORAGE_ACCOUNT_ID}"

# Federated Identity Credential

export AKS_OIDC_ISSUER=$(az aks show \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${CLUSTER_NAME}" \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

az identity federated-credential create \
  --name "fc-ray-collector" \
  --identity-name "${IDENTITY_NAME}" \
  --resource-group "${RESOURCE_GROUP}" \
  --issuer "${AKS_OIDC_ISSUER}" \
  --subject "system:serviceaccount:${SERVICE_ACCOUNT_NAMESPACE}:${SERVICE_ACCOUNT_NAME}" \
  --audience "api://AzureADTokenExchange"

# Create ACR and attach to AKS
az acr create --resource-group "${RESOURCE_GROUP}" --name "${ACR_NAME}" --sku Basic
az aks update --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" --attach-acr "${ACR_NAME}"

# Build collector image
cd historyserver
az acr build --registry "${ACR_NAME}" --image collector:v0.1.0 . -f Dockerfile.collector

# Install KubeRay Operator

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1

# Create ServiceAccount with Workload Identity

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ray-collector
  namespace: default
  annotations:
    azure.workload.identity/client-id: "${IDENTITY_CLIENT_ID}"
EOF

2. Setup RayCluster with Azure Storage-backed Collector Sidecar

kubectl apply -f - <<EOF
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-wi-test
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    template:
      metadata:
        labels:
          azure.workload.identity/use: "true"
      spec:
        serviceAccountName: ray-collector
        containers:
        - name: ray-head
          image: rayproject/ray:2.50.1
          env:
          - name: RAY_enable_ray_event
            value: "true"
          - name: RAY_enable_core_worker_ray_event_to_aggregator
            value: "true"
          - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR
            value: "http://localhost:8084/v1/events"
          - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EXPOSABLE_EVENT_TYPES
            value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT,TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT,ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT"
          resources:
            limits:
              cpu: "2"
              memory: 4Gi
            requests:
              cpu: "500m"
              memory: 1Gi
          lifecycle:
            postStart:
              exec:
                command:
                - /bin/sh
                - -c
                - |
                  while true; do
                    nodeid=\$(ps -ef | grep 'raylet.*--node_id' | grep -v grep | sed -n 's/.*--node_id=\([^ ]*\).*/\1/p' | head -1)
                    if [ -n "\$nodeid" ]; then
                      echo "\$nodeid" > /tmp/ray/raylet_node_id
                      break
                    fi
                    sleep 1
                  done
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        - name: collector
          image: kuberaycollector.azurecr.io/collector:v0.1.0
          env:
          - name: AZURE_STORAGE_ACCOUNT_URL
            value: "https://${STORAGE_ACCOUNT_NAME}.blob.core.windows.net"
          - name: AZURE_STORAGE_CONTAINER
            value: "ray-historyserver"
          command:
          - collector
          - --role=Head
          - --runtime-class-name=azureblob
          - --ray-cluster-name=raycluster-wi-test
          - --events-port=8084
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        volumes:
        - name: ray-logs
          emptyDir: {}
EOF

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l ray.io/cluster=raycluster-wi-test --timeout=120s

# Submit a Ray job to generate activity
kubectl exec -it $(kubectl get pods -l ray.io/cluster=raycluster-wi-test -o jsonpath='{.items[0].metadata.name}') \
  -c ray-head -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

# Delete the cluster to trigger log upload
kubectl delete raycluster raycluster-wi-test

# Check Azure Blob Storage for uploaded files
az storage blob list \
  --account-name "${STORAGE_ACCOUNT_NAME}" \
  --container-name ray-historyserver \
  --auth-mode login \
  --output table

# Cleanup
az group delete --name "${RESOURCE_GROUP}" --yes --no-wait

3. Screenshotss

Azure Storage Browser (with the uploaded session files from the Collector)

Logs when new events were detected

Demo Video

azure_blob_recording_compressed.mp4

Note: E2E tests using Azurite will be added in a follow-up PR to keep this one focused on the core implementation.

Add Azure Blob Storage as a storage backend for the History Server, implementing the StorageReader and StorageWriter interfaces. This completes coverage for major cloud providers: - AWS S3 (existing) - Aliyun OSS (existing) - GCS (in progress by Google team) - Azure Blob Storage (this PR) Features: - Connection string authentication - Auto-creation of container if not exists - Full implementation of StorageReader/StorageWriter interfaces - Unit tests using Azurite emulator - Documentation with setup instructions Related issue: ray-project#4398

…support Add support for Microsoft Entra Workload Identity as an alternative to connection string authentication for Azure Blob Storage. This enables passwordless authentication for AKS-hosted collectors. Changes: - Add AuthMode configuration with connection_string, workload_identity, and default options - Add AZURE_STORAGE_ACCOUNT_URL environment variable for token-based auth - Add AZURE_STORAGE_AUTH_MODE for explicit auth mode selection - Auto-detect authentication method when not explicitly specified - Add azidentity dependency for DefaultAzureCredential support - Update README with workload identity setup and troubleshooting guide

- Add missing LogFiles channel to match S3/AliyunOSS struct - Simplify README to match project style (257 -> 50 lines) - Add JSON config documentation for /var/collector-config/data - Consolidate duplicate config code with populateFromEnvAndJSON() - Remove unnecessary panic recovery blocks - Simplify auth mode detection logic

Fix two issues causing "<no name>" display in Azure Storage Explorer: 1. Fix leading slash in metadir path construction (collector.go) - Changed `path.Clean(r.RootDir + "/" + "metadir")` to `path.Join(r.RootDir, "metadir")` - When RootDir is empty, the old code produced "/metadir" paths with an empty first segment that Azure displayed as "<no name>" 2. Make CreateDirectory a no-op for Azure Blob Storage (azureblob.go) - Azure Blob Storage doesn't require explicit directory markers - Virtual directories are automatically inferred from blob paths - The zero-byte marker blobs ending with "/" caused "<no name>" display issues in Azure Storage Explorer

…uster Enhance the documentation for setting up the collector with a Ray cluster by adding important configuration details. This includes instructions for sharing the `/tmp/ray` directory using an `emptyDir` volume and implementing a `postStart` lifecycle hook to write the raylet node ID to a file. These changes aim to prevent common errors related to session logs and node identification.

Align Azure Blob Storage unit tests with the existing S3 and AliyunOSS test patterns. The tests now focus on basic path manipulation rather than full integration tests requiring Azurite. E2E tests with Azurite will be added in a follow-up.

The Endpoint field was populated but never used. For Azure Blob Storage: - Connection strings embed the endpoint (BlobEndpoint=...) - AccountURL serves as the endpoint for token authentication This removes the dead code to avoid user confusion.

Close resp.Body when DownloadStream fails to prevent connection leaks. This matches the cleanup pattern used in the S3 implementation.

Replace UploadBuffer with UploadStream to avoid loading entire files into memory. This matches the streaming approach used in S3/AliyunOSS implementations and prevents memory exhaustion with large log files.

Create a fresh context for the retry download attempt in GetContent. The original context may have expired or have minimal time remaining after the listing operation completes.

- Add panic recovery to ListFiles and List methods to match S3/AliyunOSS - Use safe type assertions in config.go to prevent panics on invalid JSON

…types - Remove unused HttpClient field and net/http import - Use azcore.ResponseError for error type checking instead of string matching when detecting ContainerAlreadyExists error

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

historyserver/pkg/storage/azureblob/azureblob.go

Move io.ReadAll inside the retry block to ensure the stream is fully read before the context is cancelled. The Azure SDK streams data from the network, so cancelling the context before reading can cause incomplete data or errors.

…ation - Fix GetContent fallback to match full path instead of basename to avoid returning wrong content from nested directories with duplicate filenames. Also use delimiter to list only direct children. - Add configurable timeout constants: uploadTimeout (5m), listTimeout (2m), downloadTimeout (10m) for large file handling - Add parseAuthMode helper to normalize and validate azureAuthMode from JSON config (handles case-insensitivity and invalid values)

ikchifo · 2026-01-20T01:39:33Z

Ready for review. @Future-Outlier - Please take a look when you get the chance.
cc: @JiangJiaWei1103 @rueian as I a saw you've recently interacted with code for the history server.

Future-Outlier

thank you for this PR!
do you mind provide screenshots or a video to prove this really works in Azure Blob Storage?

ikchifo · 2026-01-20T04:36:36Z

thank you for this PR! do you mind provide screenshots or a video to prove this really works in Azure Blob Storage?

@Future-Outlier Updated the PR description with a video showing:

an initially empty Azure Blob Container (S3 bucket equivalent)
the collector sidecar logs showing that the azureblob runtimeclass is being used at start-up
submitting a test job to the RayCluster
showing events being emitted in the collecter sidecar after the job is submitted
delete the RayCluster to trigger an upload of the history-server files
show files uploaded in the Blob container

The manifests and additional details used to run the test can are also in the PR's description

ikchifo added 4 commits January 15, 2026 20:37

ikchifo force-pushed the feature/azure-blob-storage branch 2 times, most recently from 3a15457 to c3ab6d1 Compare January 19, 2026 22:50

ikchifo force-pushed the feature/azure-blob-storage branch from c3ab6d1 to deb63c7 Compare January 19, 2026 22:52

ikchifo added 2 commits January 19, 2026 17:56

Merge branch 'master' into feature/azure-blob-storage

65cafcf

ikchifo marked this pull request as ready for review January 19, 2026 23:15

This comment was marked as outdated.

Sign in to view

[history server][storage] Add response body cleanup on download error

fcb155a

Close resp.Body when DownloadStream fails to prevent connection leaks. This matches the cleanup pattern used in the S3 implementation.

This comment was marked as outdated.

Sign in to view

ikchifo added 4 commits January 19, 2026 19:14

[history server][storage] Use streaming upload for Azure Blob Storage

17a7f8f

Replace UploadBuffer with UploadStream to avoid loading entire files into memory. This matches the streaming approach used in S3/AliyunOSS implementations and prevents memory exhaustion with large log files.

[history server][storage] Fix context reuse in GetContent retry

6a9bee4

Create a fresh context for the retry download attempt in GetContent. The original context may have expired or have minimal time remaining after the listing operation completes.

[history server][storage] Add panic recovery and safe type assertions

4b9a192

- Add panic recovery to ListFiles and List methods to match S3/AliyunOSS - Use safe type assertions in config.go to prevent panics on invalid JSON

[history server][storage] Remove unused HttpClient and use SDK error …

9130419

…types - Remove unused HttpClient field and net/http import - Use azcore.ResponseError for error type checking instead of string matching when detecting ContainerAlreadyExists error

cursor bot reviewed Jan 20, 2026

View reviewed changes

historyserver/pkg/storage/azureblob/azureblob.go Outdated Show resolved Hide resolved

ikchifo added 2 commits January 19, 2026 19:49

Future-Outlier reviewed Jan 20, 2026

View reviewed changes

JiangJiaWei1103 moved this to To Review in My Kuberay & Ray Jan 20, 2026

JiangJiaWei1103 added this to My Kuberay & Ray Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[history server][storage] Add Azure Blob Storage support #4413

[history server][storage] Add Azure Blob Storage support #4413

Uh oh!

ikchifo commented Jan 18, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

ikchifo commented Jan 20, 2026

Uh oh!

Future-Outlier left a comment •

edited

Loading

Uh oh!

ikchifo commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[history server][storage] Add Azure Blob Storage support #4413

Are you sure you want to change the base?

[history server][storage] Add Azure Blob Storage support #4413

Uh oh!

Conversation

ikchifo commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Configuration

Related issue number

Checks

End-to-End Testing on AKS with Workload Identity

Environment Variables

Azure Storage Browser (with the uploaded session files from the Collector)

Logs when new events were detected

Demo Video

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ikchifo commented Jan 20, 2026

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikchifo commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikchifo commented Jan 18, 2026 •

edited

Loading

Future-Outlier left a comment •

edited

Loading