Skip to content

Conversation

@machichima
Copy link
Collaborator

@machichima machichima commented Jan 18, 2026

Why are these changes needed?

Enable getting logs of a specific file through /api/v0/logs/file in history server from either live or dead cluster.

Manual test: #4411 (comment)

Related issue number

Closes #4387

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
@machichima
Copy link
Collaborator Author

Manual test:

  1. Following setup guide here, but skip "7. Delete Ray Cluster (Trigger Log Upload)" step
  2. Get logs from the live cluster
❯ cat ~/cookies.txt
# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

localhost       FALSE   /       FALSE   1768701371      session_name    live
localhost       FALSE   /       FALSE   1768701371      cluster_namespace       default
localhost       FALSE   /       FALSE   1768701371      cluster_name    raycluster-historyserver

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30"
{"result": true, "msg": "", "data": {"result": {"internal": ["old/", "events/", "export_events/", "ray_client_server.out", "ray_client_server.err", "dashboard_TrainHead.log", "dashboard_TrainHead.out", "dashboard_TrainHead.err", "dashboard_MetricsHead.log", "dashboard_MetricsHead.out", "dashboard_MetricsHead.err", "dashboard_JobHead.log", "dashboard_JobHead.out", "dashboard_JobHead.err", "dashboard_ServeHead.log", "dashboard_ServeHead.out", "dashboard_ServeHead.err", "dashboard_DataHead.log", "dashboard_DataHead.out", "dashboard_DataHead.err", "dashboard_EventHead.log", "dashboard_EventHead.out", "dashboard_EventHead.err", "dashboard_ReportHead.log", "dashboard_ReportHead.out", "dashboard_ReportHead.err", "dashboard_NodeHead.log", "dashboard_NodeHead.out", "dashboard_NodeHead.err", "dashboard_StateHead.log", "dashboard_StateHead.out", "dashboard_StateHead.err", "debug_state.txt", "ray_process_exit.log", "log_monitor.log", "log_monitor.out", "log_monitor.err", "nsight/", "rocprof_sys/", "profiles/", "job-driver-rayjob-9p6gf.log", "runtime_env_setup-01000000.log", "jobs/"], "gcs_server": ["gcs_server.out", "gcs_server.err"], "autoscaler": ["monitor.log", "monitor.out", "monitor.err"], "dashboard": ["dashboard.log", "dashboard.out", "dashboard.err"], "raylet": ["raylet.out", "raylet.err"], "agent": ["dashboard_agent.log", "dashboard_agent.out", "dashboard_agent.err", "runtime_env_agent.out", "runtime_env_agent.err", "runtime_env_agent.log"], "driver": ["python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_525.log", "python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_779.log"], "core_worker": ["python-core-worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90_730.log"], "worker_out": ["worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.out"], "worker_err": ["worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.err"]}}}%

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=raylet.out&lines=10"
[state-dump]    ray::rpc::JobInfoGcsService.grpc_client.MarkJobFinished.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms
[state-dump]    ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 0.14ms, total = 0.14ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump]    NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump]    Subscriber.HandlePublishedMessage_GCS_NODE_ADDRESS_AND_LIVENESS_CHANNEL - 1 total (0 active), Execution time: mean = 0.02ms, total = 0.02ms, Queueing time: mean = 0.05ms, max = 0.05ms, min = 0.05ms, total = 0.05ms
[state-dump]    ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 8.80ms, total = 8.80ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.02ms, total = 0.02ms
[state-dump]    ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = 0.00ms, min = 0.00ms, total = 0.00ms
[state-dump]    ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 0.24ms, total = 0.24ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl "http://localhost:8080/clusters"

[
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "live",
  "createTime": "2026-01-18 01:45:46 +0000 UTC",
  "createTimeStamp": 1768700746
 }
]%
  1. Delete the RayCluster to trigger the upload
  2. Get logs from the dead cluster
❯ SESSION="session_2026-01-17_17-46-19_978420_1"

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"

{
 "name": "raycluster-historyserver",
 "namespace": "default",
 "result": "success",
 "session": "session_2026-01-17_17-46-19_978420_1"
}%


~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30"
{"data":{"result":{"padding":["dashboard.err","dashboard.log","dashboard.out","dashboard_DataHead.err","dashboard_DataHead.log","dashboard_DataHead.out","dashboard_EventHead.err","dashboard_EventHead.log","dashboard_EventHead.out","dashboard_JobHead.err","dashboard_JobHead.log","dashboard_JobHead.out","dashboard_MetricsHead.err","dashboard_MetricsHead.log","dashboard_MetricsHead.out","dashboard_NodeHead.err","dashboard_NodeHead.log","dashboard_NodeHead.out","dashboard_ReportHead.err","dashboard_ReportHead.log","dashboard_ReportHead.out","dashboard_ServeHead.err","dashboard_ServeHead.log","dashboard_ServeHead.out","dashboard_StateHead.err","dashboard_StateHead.log","dashboard_StateHead.out","dashboard_TrainHead.err","dashboard_TrainHead.log","dashboard_TrainHead.out","dashboard_agent.err","dashboard_agent.log","dashboard_agent.out","debug_state.txt","gcs_server.err","gcs_server.out","job-driver-rayjob-9p6gf.log","log_monitor.err","log_monitor.log","log_monitor.out","monitor.err","monitor.log","monitor.out","python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_525.log","python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_779.log","python-core-worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90_730.log","ray_client_server.err","ray_client_server.out","ray_process_exit.log","raylet.err","raylet.out","runtime_env_agent.err","runtime_env_agent.log","runtime_env_agent.out","runtime_env_setup-01000000.log","worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.err","worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.out","events/","export_events/","jobs/"]}}}%

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=raylet.out&lines=10"
reason_message: "received SIGTERM"

[2026-01-17 17:52:04,876 I 495 495] (raylet) accessor.cc:186: Unregistering node node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30
[2026-01-17 17:52:04,878 I 495 495] (raylet) accessor.cc:194: Finished unregistering node info, status = OK node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30
[2026-01-17 17:52:04,878 W 495 510] (raylet) store.cc:365: Disconnecting client due to connection error with code 2: End of file
[2026-01-17 17:52:04,881 I 495 495] (raylet) agent_manager.cc:116: Killing agent dashboard_agent, pid 525.
[2026-01-17 17:52:04,885 I 495 526] (raylet) agent_manager.cc:83: Agent process with name dashboard_agent exited, exit code 0.
[2026-01-17 17:52:04,885 I 495 495] (raylet) agent_manager.cc:116: Killing agent runtime_env_agent, pid 527.
[2026-01-17 17:52:04,886 I 495 528] (raylet) agent_manager.cc:83: Agent process with name runtime_env_agent exited, exit code 0.
[2026-01-17 17:52:04,887 I 495 495] (raylet) stats.h:149: Stats module has shutdown.%

~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl "http://localhost:8080/clusters"
[
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "session_2026-01-17_17-46-19_978420_1",
  "createTime": "2026-01-17T17:46:19Z",
  "createTimeStamp": 1768671979
 }
]%

Doc("get logfile").Param(ws.QueryParameter("node_id", "node_id")).
Param(ws.QueryParameter("filename", "filename")).
Param(ws.QueryParameter("lines", "lines")).
Param(ws.QueryParameter("format", "format")).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@win5923 win5923 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support other queries (like actor_id, task_id..), or will this be handled in a follow-up?

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

index := 0
totalLines := 0

// Get the last N lines following Ray Dashboard API behavior with circular buffer
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: machichima <nary12321@gmail.com>
@machichima
Copy link
Collaborator Author

When trying to get the log file with path containing ../, we will raise error directly to prevent path traversal attacks

❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=../../raylet.out&lines=0"
invalid path: ../ not allowed in the path (node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30, filename=../../raylet.out)%

@machichima
Copy link
Collaborator Author

@Future-Outlier PTAL, thank you!

@Future-Outlier Future-Outlier self-assigned this Jan 19, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@win5923 win5923 self-requested a review January 20, 2026 11:58
Comment on lines 600 to 606
// Convert lines parameter to int
maxLines := 0
if lines != "" {
if parsedLines, err := strconv.Atoi(lines); err == nil {
maxLines = parsedLines
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Convert lines parameter to int
maxLines := 0
if lines != "" {
if parsedLines, err := strconv.Atoi(lines); err == nil {
maxLines = parsedLines
}
}
// Convert lines parameter to int
maxLines := 0
if lines != "" {
parsedLines, err := strconv.Atoi(lines)
if err != nil {
resp.WriteErrorString(http.StatusInternalServerError,fmt.Sprintf("invalid lines parameter: %s", lines))
return
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a092446

resp.WriteError(400, err)
return
}
resp.Write(content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resp.Write(content)
resp.Header().Set("Content-Type", "text/plain")
resp.Write(content)

https://github.com/ray-project/ray/blob/22f7f7d85cdfe3b628b3a9e9aa37cf2ae3954820/python/ray/dashboard/modules/state/state_head.py#L264

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9fc90c3

content, err := s._getNodeLogFile(clusterNameID+"_"+clusterNamespace, sessionName, nodeID, filename, maxLines)
if err != nil {
logrus.Errorf("Error getting node log file: %v", err)
resp.WriteError(400, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resp.WriteError(400, err)
resp.WriteError(http.StatusInternalServerError, err)

https://github.com/ray-project/ray/blob/22f7f7d85cdfe3b628b3a9e9aa37cf2ae3954820/python/ray/dashboard/modules/state/state_head.py#L282-L284

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c986529

func (s *ServerHandler) _getNodeLogFile(rayClusterNameID, sessionID, nodeID, filename string, maxLines int) ([]byte, error) {
logPath := path.Join(sessionID, "logs", nodeID, filename)

reader := s.reader.GetContent(rayClusterNameID, logPath)
Copy link
Member

@win5923 win5923 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to implement in this PR. For StorageReader, but i think we can introduce a GetContentStream method that returns an io.ReadCloser instead of loading the entire file into memory. This would allow callers to stream content directly and manage resource cleanup explicitly, which is essential for handling large log files efficiently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestions! Will do it in follow-up

Comment on lines +574 to +577
// Parse query parameters
nodeID := req.QueryParameter("node_id")
filename := req.QueryParameter("filename")
lines := req.QueryParameter("lines")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define an struct, similar to Ray’s GetLogOptions?

Copy link
Collaborator Author

@machichima machichima Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there's only 3 parameters here, I think it's not needed for now. If in the future we want to add support for those options, we can add it. WDYT?

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] support endpoint /api/v0/logs/file

3 participants