-
Notifications
You must be signed in to change notification settings - Fork 689
[Feature][history server] support endpoint /api/v0/logs/file
#4411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Feature][history server] support endpoint /api/v0/logs/file
#4411
Conversation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
|
Manual test:
❯ cat ~/cookies.txt
# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
localhost FALSE / FALSE 1768701371 session_name live
localhost FALSE / FALSE 1768701371 cluster_namespace default
localhost FALSE / FALSE 1768701371 cluster_name raycluster-historyserver
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30"
{"result": true, "msg": "", "data": {"result": {"internal": ["old/", "events/", "export_events/", "ray_client_server.out", "ray_client_server.err", "dashboard_TrainHead.log", "dashboard_TrainHead.out", "dashboard_TrainHead.err", "dashboard_MetricsHead.log", "dashboard_MetricsHead.out", "dashboard_MetricsHead.err", "dashboard_JobHead.log", "dashboard_JobHead.out", "dashboard_JobHead.err", "dashboard_ServeHead.log", "dashboard_ServeHead.out", "dashboard_ServeHead.err", "dashboard_DataHead.log", "dashboard_DataHead.out", "dashboard_DataHead.err", "dashboard_EventHead.log", "dashboard_EventHead.out", "dashboard_EventHead.err", "dashboard_ReportHead.log", "dashboard_ReportHead.out", "dashboard_ReportHead.err", "dashboard_NodeHead.log", "dashboard_NodeHead.out", "dashboard_NodeHead.err", "dashboard_StateHead.log", "dashboard_StateHead.out", "dashboard_StateHead.err", "debug_state.txt", "ray_process_exit.log", "log_monitor.log", "log_monitor.out", "log_monitor.err", "nsight/", "rocprof_sys/", "profiles/", "job-driver-rayjob-9p6gf.log", "runtime_env_setup-01000000.log", "jobs/"], "gcs_server": ["gcs_server.out", "gcs_server.err"], "autoscaler": ["monitor.log", "monitor.out", "monitor.err"], "dashboard": ["dashboard.log", "dashboard.out", "dashboard.err"], "raylet": ["raylet.out", "raylet.err"], "agent": ["dashboard_agent.log", "dashboard_agent.out", "dashboard_agent.err", "runtime_env_agent.out", "runtime_env_agent.err", "runtime_env_agent.log"], "driver": ["python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_525.log", "python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_779.log"], "core_worker": ["python-core-worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90_730.log"], "worker_out": ["worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.out"], "worker_err": ["worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.err"]}}}%
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=raylet.out&lines=10"
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.MarkJobFinished.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 0.14ms, total = 0.14ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] Subscriber.HandlePublishedMessage_GCS_NODE_ADDRESS_AND_LIVENESS_CHANNEL - 1 total (0 active), Execution time: mean = 0.02ms, total = 0.02ms, Queueing time: mean = 0.05ms, max = 0.05ms, min = 0.05ms, total = 0.05ms
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 8.80ms, total = 8.80ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.02ms, total = 0.02ms
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = 0.00ms, min = 0.00ms, total = 0.00ms
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 0.24ms, total = 0.24ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl "http://localhost:8080/clusters"
[
{
"name": "raycluster-historyserver",
"namespace": "default",
"sessionName": "live",
"createTime": "2026-01-18 01:45:46 +0000 UTC",
"createTimeStamp": 1768700746
}
]%
❯ SESSION="session_2026-01-17_17-46-19_978420_1"
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"
{
"name": "raycluster-historyserver",
"namespace": "default",
"result": "success",
"session": "session_2026-01-17_17-46-19_978420_1"
}%
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30"
{"data":{"result":{"padding":["dashboard.err","dashboard.log","dashboard.out","dashboard_DataHead.err","dashboard_DataHead.log","dashboard_DataHead.out","dashboard_EventHead.err","dashboard_EventHead.log","dashboard_EventHead.out","dashboard_JobHead.err","dashboard_JobHead.log","dashboard_JobHead.out","dashboard_MetricsHead.err","dashboard_MetricsHead.log","dashboard_MetricsHead.out","dashboard_NodeHead.err","dashboard_NodeHead.log","dashboard_NodeHead.out","dashboard_ReportHead.err","dashboard_ReportHead.log","dashboard_ReportHead.out","dashboard_ServeHead.err","dashboard_ServeHead.log","dashboard_ServeHead.out","dashboard_StateHead.err","dashboard_StateHead.log","dashboard_StateHead.out","dashboard_TrainHead.err","dashboard_TrainHead.log","dashboard_TrainHead.out","dashboard_agent.err","dashboard_agent.log","dashboard_agent.out","debug_state.txt","gcs_server.err","gcs_server.out","job-driver-rayjob-9p6gf.log","log_monitor.err","log_monitor.log","log_monitor.out","monitor.err","monitor.log","monitor.out","python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_525.log","python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_779.log","python-core-worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90_730.log","ray_client_server.err","ray_client_server.out","ray_process_exit.log","raylet.err","raylet.out","runtime_env_agent.err","runtime_env_agent.log","runtime_env_agent.out","runtime_env_setup-01000000.log","worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.err","worker-441560b06fac01ae9a1abbf311306319307ce56a12f43fcaa5c9cc90-01000000-730.out","events/","export_events/","jobs/"]}}}%
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=raylet.out&lines=10"
reason_message: "received SIGTERM"
[2026-01-17 17:52:04,876 I 495 495] (raylet) accessor.cc:186: Unregistering node node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30
[2026-01-17 17:52:04,878 I 495 495] (raylet) accessor.cc:194: Finished unregistering node info, status = OK node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30
[2026-01-17 17:52:04,878 W 495 510] (raylet) store.cc:365: Disconnecting client due to connection error with code 2: End of file
[2026-01-17 17:52:04,881 I 495 495] (raylet) agent_manager.cc:116: Killing agent dashboard_agent, pid 525.
[2026-01-17 17:52:04,885 I 495 526] (raylet) agent_manager.cc:83: Agent process with name dashboard_agent exited, exit code 0.
[2026-01-17 17:52:04,885 I 495 495] (raylet) agent_manager.cc:116: Killing agent runtime_env_agent, pid 527.
[2026-01-17 17:52:04,886 I 495 528] (raylet) agent_manager.cc:83: Agent process with name runtime_env_agent exited, exit code 0.
[2026-01-17 17:52:04,887 I 495 495] (raylet) stats.h:149: Stats module has shutdown.%
~/workData/open-source/kuberay history-server-logs-file *4 ?1
❯ curl "http://localhost:8080/clusters"
[
{
"name": "raycluster-historyserver",
"namespace": "default",
"sessionName": "session_2026-01-17_17-46-19_978420_1",
"createTime": "2026-01-17T17:46:19Z",
"createTimeStamp": 1768671979
}
]%
|
| Doc("get logfile").Param(ws.QueryParameter("node_id", "node_id")). | ||
| Param(ws.QueryParameter("filename", "filename")). | ||
| Param(ws.QueryParameter("lines", "lines")). | ||
| Param(ws.QueryParameter("format", "format")). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no "format" query string in ray dashboard /api/logs/file endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support other queries (like actor_id, task_id..), or will this be handled in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
| index := 0 | ||
| totalLines := 0 | ||
|
|
||
| // Get the last N lines following Ray Dashboard API behavior with circular buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray dashboard /api/logs/file endpoint get the "last N lines" when the line is set:
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Signed-off-by: machichima <nary12321@gmail.com>
|
When trying to get the log file with path containing ❯ curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30&filename=../../raylet.out&lines=0"
invalid path: ../ not allowed in the path (node_id=15755942e81843fd6a8ef2a788fb9d9b7605d16643cb03526c52ab30, filename=../../raylet.out)% |
|
@Future-Outlier PTAL, thank you! |
Future-Outlier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @win5923 @AndySung320 @justinyeh1995 to help review
| // Convert lines parameter to int | ||
| maxLines := 0 | ||
| if lines != "" { | ||
| if parsedLines, err := strconv.Atoi(lines); err == nil { | ||
| maxLines = parsedLines | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Convert lines parameter to int | |
| maxLines := 0 | |
| if lines != "" { | |
| if parsedLines, err := strconv.Atoi(lines); err == nil { | |
| maxLines = parsedLines | |
| } | |
| } | |
| // Convert lines parameter to int | |
| maxLines := 0 | |
| if lines != "" { | |
| parsedLines, err := strconv.Atoi(lines) | |
| if err != nil { | |
| resp.WriteErrorString(http.StatusInternalServerError,fmt.Sprintf("invalid lines parameter: %s", lines)) | |
| return | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a092446
| resp.WriteError(400, err) | ||
| return | ||
| } | ||
| resp.Write(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| resp.Write(content) | |
| resp.Header().Set("Content-Type", "text/plain") | |
| resp.Write(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 9fc90c3
| content, err := s._getNodeLogFile(clusterNameID+"_"+clusterNamespace, sessionName, nodeID, filename, maxLines) | ||
| if err != nil { | ||
| logrus.Errorf("Error getting node log file: %v", err) | ||
| resp.WriteError(400, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| resp.WriteError(400, err) | |
| resp.WriteError(http.StatusInternalServerError, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in c986529
| func (s *ServerHandler) _getNodeLogFile(rayClusterNameID, sessionID, nodeID, filename string, maxLines int) ([]byte, error) { | ||
| logPath := path.Join(sessionID, "logs", nodeID, filename) | ||
|
|
||
| reader := s.reader.GetContent(rayClusterNameID, logPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to implement in this PR. For StorageReader, but i think we can introduce a GetContentStream method that returns an io.ReadCloser instead of loading the entire file into memory. This would allow callers to stream content directly and manage resource cleanup explicitly, which is essential for handling large log files efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestions! Will do it in follow-up
| // Parse query parameters | ||
| nodeID := req.QueryParameter("node_id") | ||
| filename := req.QueryParameter("filename") | ||
| lines := req.QueryParameter("lines") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can define an struct, similar to Ray’s GetLogOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there's only 3 parameters here, I think it's not needed for now. If in the future we want to add support for those options, we can add it. WDYT?
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Why are these changes needed?
Enable getting logs of a specific file through
/api/v0/logs/filein history server from either live or dead cluster.Manual test: #4411 (comment)
Related issue number
Closes #4387
Checks