Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions historyserver/docs/set_up_collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,54 @@ Password: minioadmin
Finally, you can check if the collector works as expected by deploying a Ray cluster with the collector enabled and
interacting with the minio UI.

> [!IMPORTANT]
> The collector sidecar must share the `/tmp/ray` directory with the Ray container using an `emptyDir` volume. This
> allows the collector to read session logs and events from the Ray process. Without this shared volume, the collector
> will fail with errors like `read session_latest file error`. See the example manifest for the required volume
> configuration:

```yaml
spec:
containers:
- name: ray-head
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
- name: collector
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
volumes:
- name: ray-logs
emptyDir: {}
```

> [!IMPORTANT]
> The Ray container must also include a `postStart` lifecycle hook to write the raylet node ID to a file. The collector
> reads this file to identify the node. Without this hook, the collector will fail with errors like
> `read nodeid file error`. Add the following lifecycle configuration to the Ray container:

```yaml
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
while true; do
nodeid=$(ps -ef | grep 'raylet.*--node_id' | grep -v grep | sed -n 's/.*--node_id=\([^ ]*\).*/\1/p' | head -1)
if [ -n "$nodeid" ]; then
echo "$nodeid" > /tmp/ray/raylet_node_id
break
fi
sleep 1
done
```

> Note: The example manifest at `config/raycluster.yaml` uses `grep -oP` (Perl regex) which may not be available in all
> containers. The `sed` command above is more portable.

```bash
# Apply the Ray cluster manifest.
kubectl apply -f historyserver/config/raycluster.yaml
Expand Down
9 changes: 9 additions & 0 deletions historyserver/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ module github.com/ray-project/kuberay/historyserver
go 1.25.0

require (
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.20.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1
github.com/Azure/azure-sdk-for-go/sdk/storage/azblob v1.6.4
github.com/alibabacloud-go/tea v1.3.11
github.com/aliyun/aliyun-oss-go-sdk v3.0.2+incompatible
github.com/aliyun/credentials-go v1.4.7
Expand Down Expand Up @@ -84,9 +87,15 @@ require (
)

require (
github.com/Azure/azure-sdk-for-go/sdk/internal v1.11.2 // indirect
github.com/AzureAD/microsoft-authentication-library-for-go v1.6.0 // indirect
github.com/golang-jwt/jwt/v5 v5.3.0 // indirect
github.com/kylelemons/godebug v1.1.0 // indirect
github.com/moby/spdystream v0.5.0 // indirect
github.com/pkg/browser v0.0.0-20240102092130-5ac0b6a4141c // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/stretchr/testify v1.11.1 // indirect
golang.org/x/crypto v0.45.0 // indirect
golang.org/x/mod v0.29.0 // indirect
golang.org/x/tools v0.38.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
Expand Down
25 changes: 25 additions & 0 deletions historyserver/go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ func (r *RayLogHandler) processSessionLatestLogs() {
// Extract the real session ID from the resolved path
sessionID := filepath.Base(sessionRealDir)
if r.EnableMeta {
metadir := path.Clean(r.RootDir + "/" + "metadir")
metadir := path.Join(r.RootDir, "metadir")
metafile := path.Clean(metadir + "/" + fmt.Sprintf("%s/%v",
utils.AppendRayClusterNameID(r.RayClusterName, r.RayClusterID),
path.Base(sessionID),
Expand Down Expand Up @@ -456,7 +456,7 @@ func (r *RayLogHandler) processSessionPrevLogs(sessionDir string) {
sessionID := parts[0]
logrus.Infof("Processing all node logs for session: %s", sessionID)
if r.EnableMeta {
metadir := path.Clean(r.RootDir + "/" + "metadir")
metadir := path.Join(r.RootDir, "metadir")
metafile := path.Clean(metadir + "/" + fmt.Sprintf("%s/%v",
utils.AppendRayClusterNameID(r.RayClusterName, r.RayClusterID),
path.Base(sessionID),
Expand Down Expand Up @@ -752,7 +752,7 @@ func (r *RayLogHandler) WatchSessionLatestLoops() {
// Handle changes to the symlink
if event.Op&(fsnotify.Create|fsnotify.Write) != 0 {
sessionID := filepath.Base(event.Name)
metadir := path.Clean(r.RootDir + "/" + "metadir")
metadir := path.Join(r.RootDir, "metadir")
metafile := path.Clean(metadir + "/" + fmt.Sprintf("%s/%v",
utils.AppendRayClusterNameID(r.RayClusterName, r.RayClusterID),
path.Base(sessionID),
Expand Down
13 changes: 8 additions & 5 deletions historyserver/pkg/collector/registry.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import (
"github.com/ray-project/kuberay/historyserver/pkg/collector/types"
"github.com/ray-project/kuberay/historyserver/pkg/storage"
"github.com/ray-project/kuberay/historyserver/pkg/storage/aliyunoss/ray"
"github.com/ray-project/kuberay/historyserver/pkg/storage/azureblob"
"github.com/ray-project/kuberay/historyserver/pkg/storage/localtest"
"github.com/ray-project/kuberay/historyserver/pkg/storage/s3"
)
Expand All @@ -15,8 +16,9 @@ func GetWriterRegistry() WriterRegistry {
}

var writerRegistry = WriterRegistry{
"aliyunoss": ray.NewWriter,
"s3": s3.NewWriter,
"aliyunoss": ray.NewWriter,
"azureblob": azureblob.NewWriter,
"s3": s3.NewWriter,
}

type ReaderRegistry map[string]func(globalData *types.RayHistoryServerConfig, data map[string]interface{}) (storage.StorageReader, error)
Expand All @@ -26,7 +28,8 @@ func GetReaderRegistry() ReaderRegistry {
}

var readerRegistry = ReaderRegistry{
"aliyunoss": ray.NewReader,
"localtest": localtest.NewReader,
"s3": s3.NewReader,
"aliyunoss": ray.NewReader,
"azureblob": azureblob.NewReader,
"localtest": localtest.NewReader,
"s3": s3.NewReader,
}
51 changes: 51 additions & 0 deletions historyserver/pkg/storage/azureblob/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Azure Blob Storage Collector

This module is the collector for Azure Blob Storage.

Azure endpoint, container, and authentication are read from environment
variables or `/var/collector-config/data`.

Content in `/var/collector-config/data` should be in JSON format, like
`{"azureContainer": "", "azureConnectionString": "", "azureAccountURL": ""}`

Set `--runtime-class-name=azureblob` to enable this module.

## Authentication

This module supports two authentication methods:

1. **Connection String**: Set `AZURE_STORAGE_CONNECTION_STRING` environment
variable. Suitable for development and testing.

2. **Workload Identity**: Set `AZURE_STORAGE_ACCOUNT_URL` environment variable
(e.g., `https://<account>.blob.core.windows.net`). For AKS with workload
identity enabled, the pod must have label `azure.workload.identity/use: "true"`
and use a ServiceAccount annotated with
`azure.workload.identity/client-id: "<client-id>"`.

If both are set, connection string takes precedence.

## Environment Variables

| Variable | Description |
|----------|-------------|
| `AZURE_STORAGE_CONNECTION_STRING` | Azure Storage connection string |
| `AZURE_STORAGE_ACCOUNT_URL` | Storage account URL for workload identity |
| `AZURE_STORAGE_CONTAINER` | Container name (default: `ray-historyserver`) |
| `AZURE_STORAGE_AUTH_MODE` | Auth mode: `connection_string`, `workload_identity`, or `default` |

## Local Development

Use [Azurite](https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azurite)
for local testing:

```bash
docker run -p 10000:10000 mcr.microsoft.com/azure-storage/azurite \
azurite-blob --blobHost 0.0.0.0 --skipApiVersionCheck
```

Connection string for Azurite:

```text
DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;
```
Loading
Loading