Skip to content

Forward pod logs for crashed K8s runs that never connected#21134

Open
desertaxle wants to merge 5 commits intomainfrom
alexs/oss-7760-k8s-observer-forward-pod-logs-for-crashed-runs-that-never
Open

Forward pod logs for crashed K8s runs that never connected#21134
desertaxle wants to merge 5 commits intomainfrom
alexs/oss-7760-k8s-observer-forward-pod-logs-for-crashed-runs-that-never

Conversation

@desertaxle
Copy link
Member

@desertaxle desertaxle commented Mar 16, 2026

Summary

  • When a Kubernetes pod crashes before the flow run establishes connectivity to the Prefect server (e.g., OOMKilled during import, bad entrypoint, missing dependencies), the observer now fetches and forwards container stdout/stderr so users can diagnose failures directly from Prefect
  • Adds observer.forward_crashed_run_logs (default True) and observer.forward_crashed_run_logs_tail_lines (default 500) settings to KubernetesObserverSettings
  • Fetches logs eagerly before the 30s reschedule-wait loop to avoid losing them to cluster GC
  • Scopes pod lookup to the failing job via job-name label, excludes Succeeded pods, sorts newest-first
  • Prioritizes the prefect-job container, falls back to first container for custom manifests, skips sidecars when primary has output
  • Tries previous container instance first (previous=True) for restartPolicy: OnFailure scenarios
  • Only forwards logs after the Crashed state proposal is accepted by the orchestrator

🤖 Generated with Claude Code

desertaxle and others added 4 commits March 16, 2026 10:57
When a Kubernetes pod crashes before the flow run process establishes
connectivity to the Prefect server (e.g., OOMKilled during import, bad
entrypoint, missing dependencies), no logs are sent to Prefect. The
observer now fetches and forwards container logs for these runs so users
can diagnose failures directly from Prefect.

Key behaviors:
- Fetches logs eagerly before the 30s reschedule-wait loop to avoid
  losing them to cluster GC (ttlSecondsAfterFinished)
- Scopes pod lookup to the failing job via job-name label
- Prioritizes the prefect-job container, falls back to first container
  for custom manifests, skips sidecars when primary has output
- Sorts retry pods newest-first so the final attempt survives truncation
- Excludes Succeeded pods, includes Running pods (restartPolicy: OnFailure)
- Tries previous container instance first for restarted containers
- Only forwards logs after the Crashed state proposal is accepted
- Respects PREFECT_LOGGING_TO_API_MAX_LOG_SIZE with truncation
- Controlled by observer.forward_crashed_run_logs (default True) and
  observer.forward_crashed_run_logs_tail_lines (default 500) settings

Closes OSS-7760

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of concatenating all container output into a single large log
entry, send each line as its own flow run log record for cleaner
formatting in the Prefect UI. Container headers are emitted at INFO
level, log lines at ERROR level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@desertaxle desertaxle marked this pull request as ready for review March 16, 2026 17:24
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 985437014e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

When no container named prefect-job exists (custom job manifest), we
cannot reliably distinguish the flow container from sidecars. Instead
of guessing based on spec ordering, include logs from all containers
so the actual traceback is never suppressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant