-
Notifications
You must be signed in to change notification settings - Fork 689
[history server][storage] Add Azure Blob Storage support #4413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add Azure Blob Storage as a storage backend for the History Server, implementing the StorageReader and StorageWriter interfaces. This completes coverage for major cloud providers: - AWS S3 (existing) - Aliyun OSS (existing) - GCS (in progress by Google team) - Azure Blob Storage (this PR) Features: - Connection string authentication - Auto-creation of container if not exists - Full implementation of StorageReader/StorageWriter interfaces - Unit tests using Azurite emulator - Documentation with setup instructions Related issue: ray-project#4398
…support Add support for Microsoft Entra Workload Identity as an alternative to connection string authentication for Azure Blob Storage. This enables passwordless authentication for AKS-hosted collectors. Changes: - Add AuthMode configuration with connection_string, workload_identity, and default options - Add AZURE_STORAGE_ACCOUNT_URL environment variable for token-based auth - Add AZURE_STORAGE_AUTH_MODE for explicit auth mode selection - Auto-detect authentication method when not explicitly specified - Add azidentity dependency for DefaultAzureCredential support - Update README with workload identity setup and troubleshooting guide
- Add missing LogFiles channel to match S3/AliyunOSS struct - Simplify README to match project style (257 -> 50 lines) - Add JSON config documentation for /var/collector-config/data - Consolidate duplicate config code with populateFromEnvAndJSON() - Remove unnecessary panic recovery blocks - Simplify auth mode detection logic
Fix two issues causing "<no name>" display in Azure Storage Explorer:
1. Fix leading slash in metadir path construction (collector.go)
- Changed `path.Clean(r.RootDir + "/" + "metadir")` to
`path.Join(r.RootDir, "metadir")`
- When RootDir is empty, the old code produced "/metadir" paths
with an empty first segment that Azure displayed as "<no name>"
2. Make CreateDirectory a no-op for Azure Blob Storage (azureblob.go)
- Azure Blob Storage doesn't require explicit directory markers
- Virtual directories are automatically inferred from blob paths
- The zero-byte marker blobs ending with "/" caused "<no name>"
display issues in Azure Storage Explorer
3a15457 to
c3ab6d1
Compare
…uster Enhance the documentation for setting up the collector with a Ray cluster by adding important configuration details. This includes instructions for sharing the `/tmp/ray` directory using an `emptyDir` volume and implementing a `postStart` lifecycle hook to write the raylet node ID to a file. These changes aim to prevent common errors related to session logs and node identification.
c3ab6d1 to
deb63c7
Compare
Align Azure Blob Storage unit tests with the existing S3 and AliyunOSS test patterns. The tests now focus on basic path manipulation rather than full integration tests requiring Azurite. E2E tests with Azurite will be added in a follow-up.
The Endpoint field was populated but never used. For Azure Blob Storage: - Connection strings embed the endpoint (BlobEndpoint=...) - AccountURL serves as the endpoint for token authentication This removes the dead code to avoid user confusion.
Close resp.Body when DownloadStream fails to prevent connection leaks. This matches the cleanup pattern used in the S3 implementation.
Replace UploadBuffer with UploadStream to avoid loading entire files into memory. This matches the streaming approach used in S3/AliyunOSS implementations and prevents memory exhaustion with large log files.
Create a fresh context for the retry download attempt in GetContent. The original context may have expired or have minimal time remaining after the listing operation completes.
- Add panic recovery to ListFiles and List methods to match S3/AliyunOSS - Use safe type assertions in config.go to prevent panics on invalid JSON
…types - Remove unused HttpClient field and net/http import - Use azcore.ResponseError for error type checking instead of string matching when detecting ContainerAlreadyExists error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Move io.ReadAll inside the retry block to ensure the stream is fully read before the context is cancelled. The Azure SDK streams data from the network, so cancelling the context before reading can cause incomplete data or errors.
…ation - Fix GetContent fallback to match full path instead of basename to avoid returning wrong content from nested directories with duplicate filenames. Also use delimiter to list only direct children. - Add configurable timeout constants: uploadTimeout (5m), listTimeout (2m), downloadTimeout (10m) for large file handling - Add parseAuthMode helper to normalize and validate azureAuthMode from JSON config (handles case-insensitivity and invalid values)
|
Ready for review. @Future-Outlier - Please take a look when you get the chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for this PR!
do you mind provide screenshots or a video to prove this really works in Azure Blob Storage?
@Future-Outlier Updated the PR description with a video showing:
The manifests and additional details used to run the test can are also in the PR's description |
Why are these changes needed?
Adds Azure Blob Storage support for the History Server, enabling Azure users to persist Ray cluster logs and events. Supports both connection string authentication (for dev/test) and Microsoft Entra Workload Identity (for production AKS deployments).
Configuration
Set
--runtime-class-name=azureblobto enable this module.Config via environment variables or
/var/collector-config/data:{"azureContainer": "", "azureConnectionString": "", "azureAccountURL": ""}AZURE_STORAGE_CONNECTION_STRINGAZURE_STORAGE_ACCOUNT_URLAZURE_STORAGE_CONTAINERray-historyserver)AZURE_STORAGE_AUTH_MODEconnection_string,workload_identity,defaultRelated issue number
Closes #4398
Checks
End-to-End Testing on AKS with Workload Identity
This feature was tested on Azure Kubernetes Service (AKS) using Microsoft Entra Workload Identity for secure, credential-free authentication to Azure Blob Storage.
1. Infrastructure Setup
Environment Variables
2. Setup RayCluster with Azure Storage-backed Collector Sidecar
3. Screenshotss
Azure Storage Browser (with the uploaded session files from the Collector)
Logs when new events were detected
Demo Video
azure_blob_recording_compressed.mp4
Note: E2E tests using Azurite will be added in a follow-up PR to keep this one focused on the core implementation.