Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No archive logs for failed pod #13237

Closed
3 of 4 tasks
scany1211 opened this issue Jun 24, 2024 · 4 comments
Closed
3 of 4 tasks

No archive logs for failed pod #13237

scany1211 opened this issue Jun 24, 2024 · 4 comments
Labels
area/archive-logs Archive Logs feature problem/more information needed Not enough information has been provide to diagnose this issue. type/bug

Comments

@scany1211
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Had enabled the archive function for argo workflow, but the failed pod logs cannot be archived. While the completed pod logs can.

image

Version

3.5.7

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Trigger the pipeline;
Some pods are exited with failure;
check the failed pod logs, which are not archived, so it cannot be shown in the webgui

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2024-06-23T10:30:12Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2024-06-23T10:30:12Z" level=info msg="cron config" cronSyncPeriod=10s
time="2024-06-23T10:30:12Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2024-06-23T10:30:12.258Z" level=info msg="not enabling pprof debug endpoints"
time="2024-06-23T10:30:12.276Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    bucket: xxx-pipeline-logs-datastore-xxx\n    encryptionOptions:\n      enableEncryption: true\n    endpoint: s3.amazonaws.com\n    region: eu-central-1\n    roleARN: arn:aws:iam::xxx:role/xxx-logs-s3-sa-role\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents:\n  enabled: true\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\n"
time="2024-06-23T10:30:12.276Z" level=info msg="Persistence configuration disabled"
time="2024-06-23T10:30:12.276Z" level=info executorImage="quay.io/argoproj/argoexec:v3.5.7" executorImagePullPolicy= managedNamespace=processing
time="2024-06-23T10:30:12.276Z" level=info msg="Leader election is turned off. Running in single-instance mode"
time="2024-06-23T10:30:12.276Z" level=info msg="starting leading" id=single-instance
time="2024-06-23T10:30:12.277Z" level=info msg="DB migration is disabled"
time="2024-06-23T10:30:12.277Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.5.7
time="2024-06-23T10:30:12.277Z" level=info msg="Current Worker Numbers" podCleanup=4 workflowTtlWorkers=4 workflowWorkers=32
time="2024-06-23T10:30:12.277Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
time="2024-06-23T10:30:12.277Z" level=info msg=Plugins executorPlugins=false
time="2024-06-23T10:30:12.277Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2024-06-23T10:30:12.314Z" level=info msg="Manager initialized successfully"
time="2024-06-23T10:30:12.662Z" level=info msg="Performing periodic GC" periodicity=5m0s
time="2024-06-23T10:30:12.662Z" level=info msg="Persistence disabled - so archived workflow GC disabled - you must restart the controller if you enable this"
time="2024-06-23T10:30:12.662Z" level=info msg="Starting workflow garbage collector controller (retentionWorkers 4)"
time="2024-06-23T10:30:12.662Z" level=info msg="Started workflow garbage collection"
time="2024-06-23T10:30:12.662Z" level=info msg="Starting CronWorkflow controller"
W0623 10:30:12.664193       1 shared_informer.go:401] The sharedIndexInformer has started, run more than once is not allowed
time="2024-06-23T10:35:12.259Z" level=info msg="Alloc=10354 TotalAlloc=39495 Sys=31589 NumGC=9 Goroutines=168"
time="2024-06-23T10:40:12.260Z" level=info msg="Alloc=10547 TotalAlloc=40203 Sys=31845 NumGC=11 Goroutines=168"
time="2024-06-23T10:45:12.259Z" level=info msg="Alloc=10403 TotalAlloc=40695 Sys=31845 NumGC=14 Goroutines=168"
time="2024-06-23T10:50:12.259Z" level=info msg="Alloc=10579 TotalAlloc=41244 Sys=31845 NumGC=16 Goroutines=168"
time="2024-06-23T10:55:12.259Z" level=info msg="Alloc=10588 TotalAlloc=41968 Sys=31845 NumGC=19 Goroutines=168"
time="2024-06-23T11:00:12.259Z" level=info msg="Alloc=10527 TotalAlloc=42413 Sys=31845 NumGC=21 Goroutines=168"
time="2024-06-23T11:05:12.259Z" level=info msg="Alloc=10599 TotalAlloc=43015 Sys=31845 NumGC=24 Goroutines=168"
time="2024-06-23T11:10:12.259Z" level=info msg="Alloc=10710 TotalAlloc=47668 Sys=31845 NumGC=26 Goroutines=168"
time="2024-06-23T11:15:12.259Z" level=info msg="Alloc=10377 TotalAlloc=50666 Sys=31845 NumGC=29 Goroutines=168"
time="2024-06-23T11:20:12.259Z" level=info msg="Alloc=10525 TotalAlloc=51186 Sys=31845 NumGC=31 Goroutines=168"
time="2024-06-23T11:25:12.259Z" level=info msg="Alloc=10463 TotalAlloc=51799 Sys=31845 NumGC=34 Goroutines=168"
time="2024-06-23T11:30:12.259Z" level=info msg="Alloc=10558 TotalAlloc=52218 Sys=31845 NumGC=36 Goroutines=168"
time="2024-06-23T11:35:12.259Z" level=info msg="Alloc=10394 TotalAlloc=55204 Sys=31845 NumGC=39 Goroutines=168"
time="2024-06-23T11:40:12.259Z" level=info msg="Alloc=10641 TotalAlloc=55874 Sys=31845 NumGC=41 Goroutines=168"
time="2024-06-23T11:45:12.259Z" level=info msg="Alloc=10465 TotalAlloc=56477 Sys=31845 NumGC=44 Goroutines=168"
time="2024-06-23T11:50:12.260Z" level=info msg="Alloc=10481 TotalAlloc=57033 Sys=31845 NumGC=46 Goroutines=168"
time="2024-06-23T11:55:12.259Z" level=info msg="Alloc=10434 TotalAlloc=60084 Sys=31845 NumGC=49 Goroutines=168"
time="2024-06-23T12:00:12.259Z" level=info msg="Alloc=10608 TotalAlloc=60659 Sys=31845 NumGC=51 Goroutines=168"
time="2024-06-23T12:05:12.260Z" level=info msg="Alloc=10551 TotalAlloc=61300 Sys=31845 NumGC=54 Goroutines=168"
time="2024-06-23T12:10:12.259Z" level=info msg="Alloc=10627 TotalAlloc=66050 Sys=32101 NumGC=56 Goroutines=168"
time="2024-06-23T12:15:12.259Z" level=info msg="Alloc=10476 TotalAlloc=69096 Sys=32101 NumGC=59 Goroutines=168"
time="2024-06-23T12:20:12.259Z" level=info msg="Alloc=10560 TotalAlloc=69593 Sys=32101 NumGC=61 Goroutines=168"
time="2024-06-23T12:25:12.260Z" level=info msg="Alloc=10411 TotalAlloc=70178 Sys=32101 NumGC=64 Goroutines=168"
time="2024-06-23T12:30:12.259Z" level=info msg="Alloc=10564 TotalAlloc=70774 Sys=32101 NumGC=66 Goroutines=168"
time="2024-06-23T12:35:12.259Z" level=info msg="Alloc=10423 TotalAlloc=73875 Sys=32101 NumGC=69 Goroutines=168"
time="2024-06-23T12:40:12.259Z" level=info msg="Alloc=10798 TotalAlloc=78602 Sys=32101 NumGC=71 Goroutines=168"
time="2024-06-23T12:45:12.260Z" level=info msg="Alloc=10374 TotalAlloc=79084 Sys=32101 NumGC=74 Goroutines=168"
time="2024-06-23T12:50:12.259Z" level=info msg="Alloc=10633 TotalAlloc=79731 Sys=32101 NumGC=76 Goroutines=168"
time="2024-06-23T12:55:12.259Z" level=info msg="Alloc=10466 TotalAlloc=82703 Sys=32101 NumGC=79 Goroutines=168"
time="2024-06-23T13:00:12.259Z" level=info msg="Alloc=10544 TotalAlloc=83258 Sys=32101 NumGC=81 Goroutines=168"
time="2024-06-23T13:05:12.260Z" level=info msg="Alloc=10475 TotalAlloc=83924 Sys=32101 NumGC=84 Goroutines=168"
time="2024-06-23T13:10:12.259Z" level=info msg="Alloc=10434 TotalAlloc=84324 Sys=32101 NumGC=86 Goroutines=168"
time="2024-06-23T13:15:12.259Z" level=info msg="Alloc=10469 TotalAlloc=87457 Sys=32101 NumGC=89 Goroutines=168"
time="2024-06-23T13:20:12.259Z" level=info msg="Alloc=10736 TotalAlloc=88152 Sys=32101 NumGC=91 Goroutines=168"
time="2024-06-23T13:25:12.259Z" level=info msg="Alloc=10374 TotalAlloc=88557 Sys=32101 NumGC=94 Goroutines=168"
time="2024-06-23T13:30:12.259Z" level=info msg="Alloc=10488 TotalAlloc=89132 Sys=32101 NumGC=96 Goroutines=168"
time="2024-06-23T13:35:12.259Z" level=info msg="Alloc=10719 TotalAlloc=96506 Sys=32101 NumGC=99 Goroutines=168"
time="2024-06-23T13:40:12.259Z" level=info msg="Alloc=10603 TotalAlloc=97042 Sys=32101 NumGC=101 Goroutines=168"
time="2024-06-23T13:45:12.259Z" level=info msg="Alloc=10519 TotalAlloc=97670 Sys=32101 NumGC=104 Goroutines=168"
time="2024-06-23T13:50:12.259Z" level=info msg="Alloc=10566 TotalAlloc=98179 Sys=32101 NumGC=106 Goroutines=168"
time="2024-06-23T13:55:12.259Z" level=info msg="Alloc=10462 TotalAlloc=101229 Sys=32101 NumGC=109 Goroutines=168"
time="2024-06-23T14:00:12.259Z" level=info msg="Alloc=10602 TotalAlloc=101728 Sys=32101 NumGC=111 Goroutines=168"
time="2024-06-23T14:05:12.259Z" level=info msg="Alloc=10404 TotalAlloc=102384 Sys=32101 NumGC=114 Goroutines=168"
time="2024-06-23T14:10:12.259Z" level=info msg="Alloc=10599 TotalAlloc=102973 Sys=32101 NumGC=116 Goroutines=168"
time="2024-06-23T14:15:12.259Z" level=info msg="Alloc=10499 TotalAlloc=106071 Sys=32101 NumGC=119 Goroutines=168"
time="2024-06-23T14:20:12.259Z" level=info msg="Alloc=10481 TotalAlloc=106622 Sys=32101 NumGC=121 Goroutines=168"
time="2024-06-23T14:25:12.259Z" level=info msg="Alloc=14649 TotalAlloc=111364 Sys=32101 NumGC=124 Goroutines=168"
time="2024-06-23T14:30:12.260Z" level=info msg="Alloc=10579 TotalAlloc=111899 Sys=32101 NumGC=126 Goroutines=168"
time="2024-06-23T14:35:12.259Z" level=info msg="Alloc=10467 TotalAlloc=114959 Sys=32101 NumGC=129 Goroutines=168"
time="2024-06-23T14:40:12.259Z" level=info msg="Alloc=10551 TotalAlloc=115448 Sys=32101 NumGC=131 Goroutines=168"
time="2024-06-23T14:45:12.259Z" level=info msg="Alloc=10460 TotalAlloc=116078 Sys=32101 NumGC=134 Goroutines=168"
time="2024-06-23T14:50:12.259Z" level=info msg="Alloc=10526 TotalAlloc=116503 Sys=32101 NumGC=136 Goroutines=168"
time="2024-06-23T14:55:12.259Z" level=info msg="Alloc=10544 TotalAlloc=119552 Sys=32101 NumGC=139 Goroutines=168"
time="2024-06-23T15:00:12.259Z" level=info msg="Alloc=10617 TotalAlloc=120130 Sys=32101 NumGC=141 Goroutines=168"
time="2024-06-23T15:05:12.259Z" level=info msg="Alloc=10388 TotalAlloc=120652 Sys=32101 NumGC=144 Goroutines=168"
time="2024-06-23T15:10:12.259Z" level=info msg="Alloc=10478 TotalAlloc=121202 Sys=32101 NumGC=146 Goroutines=168"
time="2024-06-23T15:15:12.259Z" level=info msg="Alloc=10444 TotalAlloc=128538 Sys=32101 NumGC=149 Goroutines=168"
time="2024-06-23T15:20:12.259Z" level=info msg="Alloc=10581 TotalAlloc=129076 Sys=32101 NumGC=151 Goroutines=168"
time="2024-06-23T15:25:12.259Z" level=info msg="Alloc=10412 TotalAlloc=129673 Sys=32101 NumGC=154 Goroutines=168"
time="2024-06-23T15:30:12.259Z" level=info msg="Alloc=10549 TotalAlloc=130272 Sys=32101 NumGC=156 Goroutines=168"
time="2024-06-23T15:35:12.260Z" level=info msg="Alloc=10475 TotalAlloc=133372 Sys=32101 NumGC=159 Goroutines=168"
time="2024-06-23T15:40:12.259Z" level=info msg="Alloc=10676 TotalAlloc=133967 Sys=32101 NumGC=161 Goroutines=168"
time="2024-06-23T15:45:12.259Z" level=info msg="Alloc=10531 TotalAlloc=134567 Sys=32101 NumGC=164 Goroutines=168"
time="2024-06-23T15:50:12.259Z" level=info msg="Alloc=10582 TotalAlloc=135070 Sys=32101 NumGC=166 Goroutines=168"
time="2024-06-23T15:55:12.259Z" level=info msg="Alloc=10418 TotalAlloc=138072 Sys=32101 NumGC=169 Goroutines=168"
time="2024-06-23T16:00:12.259Z" level=info msg="Alloc=10550 TotalAlloc=138598 Sys=32101 NumGC=171 Goroutines=168"
time="2024-06-23T16:05:12.259Z" level=info msg="Alloc=14776 TotalAlloc=143413 Sys=32101 NumGC=174 Goroutines=168"
time="2024-06-23T16:10:12.259Z" level=info msg="Alloc=10488 TotalAlloc=143970 Sys=32101 NumGC=176 Goroutines=168"
time="2024-06-23T16:15:12.259Z" level=info msg="Alloc=10541 TotalAlloc=147009 Sys=32101 NumGC=179 Goroutines=168"
time="2024-06-23T16:20:12.259Z" level=info msg="Alloc=10584 TotalAlloc=147591 Sys=32101 NumGC=181 Goroutines=168"
time="2024-06-23T16:25:12.259Z" level=info msg="Alloc=10444 TotalAlloc=148123 Sys=32101 NumGC=184 Goroutines=168"
time="2024-06-23T16:30:12.259Z" level=info msg="Alloc=10564 TotalAlloc=148711 Sys=32101 NumGC=186 Goroutines=168"
time="2024-06-23T16:35:12.259Z" level=info msg="Alloc=10452 TotalAlloc=151708 Sys=32101 NumGC=189 Goroutines=168"
time="2024-06-23T16:40:12.259Z" level=info msg="Alloc=10753 TotalAlloc=156505 Sys=32101 NumGC=191 Goroutines=168"
time="2024-06-23T16:45:12.259Z" level=info msg="Alloc=10457 TotalAlloc=157038 Sys=32101 NumGC=194 Goroutines=168"
time="2024-06-23T16:50:12.259Z" level=info msg="Alloc=10626 TotalAlloc=157559 Sys=32101 NumGC=196 Goroutines=168"
time="2024-06-23T16:55:12.259Z" level=info msg="Alloc=10531 TotalAlloc=160536 Sys=32101 NumGC=199 Goroutines=168"
time="2024-06-23T17:00:12.258Z" level=info msg="Alloc=10523 TotalAlloc=160999 Sys=32101 NumGC=201 Goroutines=168"
time="2024-06-23T17:05:12.259Z" level=info msg="Alloc=10518 TotalAlloc=161659 Sys=32101 NumGC=204 Goroutines=168"
time="2024-06-23T17:10:12.259Z" level=info msg="Alloc=10633 TotalAlloc=162201 Sys=32101 NumGC=206 Goroutines=168"
time="2024-06-23T17:15:12.259Z" level=info msg="Alloc=10456 TotalAlloc=169374 Sys=32101 NumGC=209 Goroutines=168"
time="2024-06-23T17:20:12.259Z" level=info msg="Alloc=10576 TotalAlloc=169949 Sys=32101 NumGC=211 Goroutines=168"
time="2024-06-23T17:25:12.259Z" level=info msg="Alloc=10612 TotalAlloc=170569 Sys=32101 NumGC=214 Goroutines=168"
time="2024-06-23T17:30:12.259Z" level=info msg="Alloc=10534 TotalAlloc=170964 Sys=32101 NumGC=216 Goroutines=168"
time="2024-06-23T17:35:12.259Z" level=info msg="Alloc=10458 TotalAlloc=174064 Sys=32101 NumGC=219 Goroutines=168"
time="2024-06-23T17:40:12.259Z" level=info msg="Alloc=10580 TotalAlloc=174549 Sys=32101 NumGC=221 Goroutines=168"
time="2024-06-23T17:45:12.259Z" level=info msg="Alloc=10462 TotalAlloc=175058 Sys=32101 NumGC=224 Goroutines=168"
time="2024-06-23T17:50:12.259Z" level=info msg="Alloc=10816 TotalAlloc=179814 Sys=32101 NumGC=226 Goroutines=168"
time="2024-06-23T17:55:12.259Z" level=info msg="Alloc=10506 TotalAlloc=182876 Sys=32101 NumGC=229 Goroutines=168"
time="2024-06-23T18:00:12.259Z" level=info msg="Alloc=10542 TotalAlloc=183428 Sys=32101 NumGC=231 Goroutines=168"
time="2024-06-23T18:05:12.259Z" level=info msg="Alloc=10544 TotalAlloc=184193 Sys=32101 NumGC=234 Goroutines=168"
time="2024-06-23T18:10:12.259Z" level=info msg="Alloc=10571 TotalAlloc=184635 Sys=32101 NumGC=236 Goroutines=168"
time="2024-06-23T18:15:12.259Z" level=info msg="Alloc=10407 TotalAlloc=187668 Sys=32101 NumGC=239 Goroutines=168"
time="2024-06-23T18:20:12.259Z" level=info msg="Alloc=10586 TotalAlloc=188268 Sys=32101 NumGC=241 Goroutines=168"
time="2024-06-23T18:25:12.259Z" level=info msg="Alloc=10448 TotalAlloc=188897 Sys=32101 NumGC=244 Goroutines=168"
time="2024-06-23T18:30:12.259Z" level=info msg="Alloc=14709 TotalAlloc=193592 Sys=32101 NumGC=246 Goroutines=168"
time="2024-06-23T18:35:12.259Z" level=info msg="Alloc=10400 TotalAlloc=196568 Sys=32101 NumGC=249 Goroutines=168"
time="2024-06-23T18:40:12.259Z" level=info msg="Alloc=10540 TotalAlloc=197120 Sys=32101 NumGC=251 Goroutines=168"
time="2024-06-23T18:45:12.259Z" level=info msg="Alloc=10467 TotalAlloc=197844 Sys=32101 NumGC=254 Goroutines=168"
time="2024-06-23T18:50:12.259Z" level=info msg="Alloc=10584 TotalAlloc=198395 Sys=32101 NumGC=256 Goroutines=168"
time="2024-06-23T18:55:12.259Z" level=info msg="Alloc=10464 TotalAlloc=201477 Sys=32101 NumGC=259 Goroutines=168"
time="2024-06-23T19:00:12.259Z" level=info msg="Alloc=10477 TotalAlloc=201972 Sys=32101 NumGC=261 Goroutines=168"
time="2024-06-23T19:05:12.259Z" level=info msg="Alloc=10400 TotalAlloc=202542 Sys=32101 NumGC=264 Goroutines=168"
time="2024-06-23T19:10:12.259Z" level=info msg="Alloc=10551 TotalAlloc=203141 Sys=32101 NumGC=266 Goroutines=168"
time="2024-06-23T19:15:12.259Z" level=info msg="Alloc=10402 TotalAlloc=206120 Sys=32101 NumGC=269 Goroutines=168"
time="2024-06-23T19:20:12.259Z" level=info msg="Alloc=14858 TotalAlloc=210985 Sys=32101 NumGC=271 Goroutines=168"
time="2024-06-23T19:25:12.259Z" level=info msg="Alloc=10481 TotalAlloc=211535 Sys=32101 NumGC=274 Goroutines=168"
time="2024-06-23T19:30:12.260Z" level=info msg="Alloc=10567 TotalAlloc=212115 Sys=32101 NumGC=276 Goroutines=168"
time="2024-06-23T19:35:12.259Z" level=info msg="Alloc=10478 TotalAlloc=215182 Sys=32101 NumGC=279 Goroutines=168"
time="2024-06-23T19:40:12.259Z" level=info msg="Alloc=10602 TotalAlloc=215789 Sys=32101 NumGC=281 Goroutines=168"
time="2024-06-23T19:45:12.259Z" level=info msg="Alloc=10463 TotalAlloc=216362 Sys=32101 NumGC=284 Goroutines=168"
time="2024-06-23T19:50:12.259Z" level=info msg="Alloc=10541 TotalAlloc=216905 Sys=32101 NumGC=286 Goroutines=168"
time="2024-06-23T19:55:12.259Z" level=info msg="Alloc=10555 TotalAlloc=220067 Sys=32101 NumGC=289 Goroutines=168"
time="2024-06-23T20:00:12.259Z" level=info msg="Alloc=10644 TotalAlloc=220561 Sys=32101 NumGC=291 Goroutines=168"
time="2024-06-23T20:05:12.259Z" level=info msg="Alloc=10485 TotalAlloc=221206 Sys=32101 NumGC=294 Goroutines=168"
time="2024-06-23T20:10:12.259Z" level=info msg="Alloc=10679 TotalAlloc=226124 Sys=32101 NumGC=296 Goroutines=168"
time="2024-06-23T20:15:12.259Z" level=info msg="Alloc=10395 TotalAlloc=229027 Sys=32101 NumGC=299 Goroutines=168"
time="2024-06-23T20:20:12.259Z" level=info msg="Alloc=10492 TotalAlloc=229650 Sys=32101 NumGC=301 Goroutines=168"
time="2024-06-23T20:25:12.259Z" level=info msg="Alloc=10414 TotalAlloc=230296 Sys=32101 NumGC=304 Goroutines=168"
time="2024-06-23T20:30:12.259Z" level=info msg="Alloc=10572 TotalAlloc=230819 Sys=32101 NumGC=306 Goroutines=168"
time="2024-06-23T20:35:12.259Z" level=info msg="Alloc=10402 TotalAlloc=233850 Sys=32101 NumGC=309 Goroutines=168"
time="2024-06-23T20:40:12.259Z" level=info msg="Alloc=10503 TotalAlloc=238630 Sys=32101 NumGC=311 Goroutines=168"
time="2024-06-23T20:45:12.259Z" level=info msg="Alloc=10450 TotalAlloc=239215 Sys=32101 NumGC=314 Goroutines=168"
time="2024-06-23T20:50:12.259Z" level=info msg="Alloc=10669 TotalAlloc=239813 Sys=32101 NumGC=316 Goroutines=168"
time="2024-06-23T20:55:12.259Z" level=info msg="Alloc=10411 TotalAlloc=242860 Sys=32101 NumGC=319 Goroutines=168"
time="2024-06-23T21:00:12.259Z" level=info msg="Alloc=10537 TotalAlloc=243334 Sys=32101 NumGC=321 Goroutines=168"
time="2024-06-23T21:05:12.259Z" level=info msg="Alloc=10481 TotalAlloc=244025 Sys=32101 NumGC=324 Goroutines=168"
time="2024-06-23T21:10:12.259Z" level=info msg="Alloc=14760 TotalAlloc=248689 Sys=32101 NumGC=326 Goroutines=168"
time="2024-06-23T21:15:12.259Z" level=info msg="Alloc=10445 TotalAlloc=251684 Sys=32101 NumGC=329 Goroutines=168"
time="2024-06-23T21:20:12.259Z" level=info msg="Alloc=10529 TotalAlloc=252185 Sys=32101 NumGC=331 Goroutines=168"
time="2024-06-23T21:25:12.259Z" level=info msg="Alloc=10442 TotalAlloc=252817 Sys=32101 NumGC=334 Goroutines=168"
time="2024-06-23T21:30:12.259Z" level=info msg="Alloc=10490 TotalAlloc=253231 Sys=32101 NumGC=336 Goroutines=168"
time="2024-06-23T21:35:12.259Z" level=info msg="Alloc=10484 TotalAlloc=256355 Sys=32101 NumGC=339 Goroutines=168"
time="2024-06-23T21:40:12.259Z" level=info msg="Alloc=10494 TotalAlloc=256834 Sys=32101 NumGC=341 Goroutines=168"
time="2024-06-23T21:45:12.259Z" level=info msg="Alloc=10466 TotalAlloc=261545 Sys=32101 NumGC=344 Goroutines=168"
time="2024-06-23T21:50:12.259Z" level=info msg="Alloc=10499 TotalAlloc=261921 Sys=32101 NumGC=346 Goroutines=168"
time="2024-06-23T21:55:12.259Z" level=info msg="Alloc=10437 TotalAlloc=265011 Sys=32101 NumGC=349 Goroutines=168"
time="2024-06-23T22:00:12.259Z" level=info msg="Alloc=10558 TotalAlloc=265616 Sys=32101 NumGC=351 Goroutines=168"
time="2024-06-23T22:05:12.259Z" level=info msg="Alloc=10502 TotalAlloc=266310 Sys=32101 NumGC=354 Goroutines=168"
time="2024-06-23T22:10:12.259Z" level=info msg="Alloc=10564 TotalAlloc=266767 Sys=32101 NumGC=356 Goroutines=168"
time="2024-06-23T22:15:12.259Z" level=info msg="Alloc=10403 TotalAlloc=269854 Sys=32101 NumGC=359 Goroutines=168"
time="2024-06-23T22:20:12.259Z" level=info msg="Alloc=10553 TotalAlloc=270521 Sys=32101 NumGC=361 Goroutines=168"
time="2024-06-23T22:25:12.259Z" level=info msg="Alloc=10677 TotalAlloc=275316 Sys=32101 NumGC=364 Goroutines=168"
time="2024-06-23T22:30:12.259Z" level=info msg="Alloc=10623 TotalAlloc=275953 Sys=32101 NumGC=366 Goroutines=168"
time="2024-06-23T22:35:12.259Z" level=info msg="Alloc=10437 TotalAlloc=278851 Sys=32101 NumGC=369 Goroutines=168"
time="2024-06-23T22:40:12.259Z" level=info msg="Alloc=10544 TotalAlloc=279433 Sys=32101 NumGC=371 Goroutines=168"
time="2024-06-23T22:45:12.259Z" level=info msg="Alloc=10450 TotalAlloc=279978 Sys=32101 NumGC=374 Goroutines=168"
time="2024-06-23T22:50:12.259Z" level=info msg="Alloc=10617 TotalAlloc=280631 Sys=32101 NumGC=376 Goroutines=168"
time="2024-06-23T22:55:12.259Z" level=info msg="Alloc=10501 TotalAlloc=283652 Sys=32101 NumGC=379 Goroutines=168"
time="2024-06-23T23:00:12.260Z" level=info msg="Alloc=10544 TotalAlloc=284135 Sys=32101 NumGC=381 Goroutines=168"
time="2024-06-23T23:05:12.260Z" level=info msg="Alloc=10405 TotalAlloc=284812 Sys=32101 NumGC=384 Goroutines=168"
time="2024-06-23T23:10:12.259Z" level=info msg="Alloc=10862 TotalAlloc=289658 Sys=32101 NumGC=386 Goroutines=168"
time="2024-06-23T23:15:12.259Z" level=info msg="Alloc=10473 TotalAlloc=292587 Sys=32101 NumGC=389 Goroutines=168"
time="2024-06-23T23:20:12.259Z" level=info msg="Alloc=10564 TotalAlloc=293147 Sys=32101 NumGC=391 Goroutines=168"
time="2024-06-23T23:25:12.259Z" level=info msg="Alloc=10419 TotalAlloc=293679 Sys=32101 NumGC=394 Goroutines=168"
time="2024-06-23T23:30:12.259Z" level=info msg="Alloc=10587 TotalAlloc=294243 Sys=32101 NumGC=396 Goroutines=168"
time="2024-06-23T23:35:12.259Z" level=info msg="Alloc=10462 TotalAlloc=297345 Sys=32101 NumGC=399 Goroutines=168"
time="2024-06-23T23:40:12.259Z" level=info msg="Alloc=10706 TotalAlloc=298039 Sys=32101 NumGC=401 Goroutines=168"
time="2024-06-23T23:45:12.259Z" level=info msg="Alloc=10422 TotalAlloc=298622 Sys=32101 NumGC=404 Goroutines=168"
time="2024-06-23T23:50:12.259Z" level=info msg="Alloc=10790 TotalAlloc=303427 Sys=32101 NumGC=406 Goroutines=168"
time="2024-06-23T23:55:12.259Z" level=info msg="Alloc=10517 TotalAlloc=306377 Sys=32101 NumGC=409 Goroutines=168"
time="2024-06-24T00:00:12.259Z" level=info msg="Alloc=10522 TotalAlloc=306846 Sys=32101 NumGC=411 Goroutines=168"
time="2024-06-24T00:05:12.260Z" level=info msg="Alloc=10494 TotalAlloc=307487 Sys=32101 NumGC=414 Goroutines=168"
time="2024-06-24T00:10:12.259Z" level=info msg="Alloc=10583 TotalAlloc=308062 Sys=32101 NumGC=416 Goroutines=168"
time="2024-06-24T00:15:12.259Z" level=info msg="Alloc=10515 TotalAlloc=311136 Sys=32101 NumGC=419 Goroutines=168"
time="2024-06-24T00:20:12.259Z" level=info msg="Alloc=10714 TotalAlloc=311774 Sys=32101 NumGC=421 Goroutines=168"
time="2024-06-24T00:25:12.259Z" level=info msg="Alloc=10453 TotalAlloc=312320 Sys=32101 NumGC=424 Goroutines=168"
time="2024-06-24T00:30:12.259Z" level=info msg="Alloc=10624 TotalAlloc=312872 Sys=32101 NumGC=426 Goroutines=168"
time="2024-06-24T00:35:12.259Z" level=info msg="Alloc=10614 TotalAlloc=320081 Sys=32101 NumGC=429 Goroutines=168"
time="2024-06-24T00:40:12.260Z" level=info msg="Alloc=10610 TotalAlloc=320725 Sys=32101 NumGC=431 Goroutines=168"
time="2024-06-24T00:45:12.259Z" level=info msg="Alloc=10471 TotalAlloc=321244 Sys=32101 NumGC=434 Goroutines=168"
time="2024-06-24T00:50:12.259Z" level=info msg="Alloc=10606 TotalAlloc=321780 Sys=32101 NumGC=436 Goroutines=168"
time="2024-06-24T00:55:12.259Z" level=info msg="Alloc=10473 TotalAlloc=324961 Sys=32101 NumGC=439 Goroutines=168"
time="2024-06-24T01:00:12.259Z" level=info msg="Alloc=10608 TotalAlloc=325504 Sys=32101 NumGC=441 Goroutines=168"
time="2024-06-24T01:05:12.259Z" level=info msg="Alloc=10597 TotalAlloc=326169 Sys=32101 NumGC=444 Goroutines=168"
time="2024-06-24T01:10:12.260Z" level=info msg="Alloc=14798 TotalAlloc=330962 Sys=32101 NumGC=446 Goroutines=168"
time="2024-06-24T01:15:12.259Z" level=info msg="Alloc=10569 TotalAlloc=334194 Sys=32101 NumGC=449 Goroutines=168"
time="2024-06-24T01:20:12.259Z" level=info msg="Alloc=10537 TotalAlloc=334701 Sys=32101 NumGC=451 Goroutines=168"
time="2024-06-24T01:25:12.259Z" level=info msg="Alloc=10448 TotalAlloc=335294 Sys=32101 NumGC=454 Goroutines=168"
time="2024-06-24T01:30:12.259Z" level=info msg="Alloc=10554 TotalAlloc=335768 Sys=32101 NumGC=456 Goroutines=168"
time="2024-06-24T01:35:12.259Z" level=info msg="Alloc=10512 TotalAlloc=338740 Sys=32101 NumGC=459 Goroutines=168"
time="2024-06-24T01:40:12.259Z" level=info msg="Alloc=10513 TotalAlloc=339214 Sys=32101 NumGC=461 Goroutines=168"
time="2024-06-24T01:45:12.259Z" level=info msg="Alloc=10467 TotalAlloc=339715 Sys=32101 NumGC=464 Goroutines=168"
time="2024-06-24T01:50:12.260Z" level=info msg="Alloc=10579 TotalAlloc=340217 Sys=32101 NumGC=466 Goroutines=168"
time="2024-06-24T01:55:12.259Z" level=info msg="Alloc=10469 TotalAlloc=343301 Sys=32101 NumGC=469 Goroutines=168"
time="2024-06-24T02:00:12.259Z" level=info msg="Alloc=10750 TotalAlloc=348166 Sys=32101 NumGC=471 Goroutines=168"
time="2024-06-24T02:05:12.259Z" level=info msg="Alloc=10504 TotalAlloc=348702 Sys=32101 NumGC=474 Goroutines=168"
time="2024-06-24T02:10:12.259Z" level=info msg="Alloc=10682 TotalAlloc=349329 Sys=32101 NumGC=476 Goroutines=168"
time="2024-06-24T02:15:12.259Z" level=info msg="Alloc=10513 TotalAlloc=352448 Sys=32101 NumGC=479 Goroutines=168"
time="2024-06-24T02:20:12.259Z" level=info msg="Alloc=10527 TotalAlloc=352947 Sys=32101 NumGC=481 Goroutines=168"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

no such logs
@scany1211
Copy link
Author

scany1211 commented Jun 24, 2024

My configuration is here:

argo-workflows:
  enabled: true
  namespaceOverride: autonomy
  fullnameOverride: argo-workflows
  executor:
    image:
      # -- Registry to use for the Workflow Executors
      registry: quay.io
  controller:
    image:
      # -- Registry to use for the controller
      registry: quay.io
    serviceAccount:
      # -- Create a service account for the server
      create: true
      # -- Labels applied to created service account
      labels: {}
      # -- Annotations applied to created service account
      annotations: {}
    extraArgs:
      - --namespaced
      - --managed-namespace
      - processing
  server:
    name: server
    servicePort: 2746
    authModes:
      - server
      # this can be removed if execute engine support authenticate with token when talking to argo workflow
    image:
      registry: quay.io # -- Registry to use for the server when installing in AWS CN
    serviceAccount:
      # -- Create a service account for the server
      create: true
      # -- Labels applied to created service account
      labels: {}
      # -- Annotations applied to created service account
      annotations: {}
    ingress:
      enabled: false # we create ingress on our own in order to utilize our template
    extraArgs:
      - --namespaced
      - --managed-namespace
      - processing
  useStaticCredentials: false
  artifactRepository:
    # -- Archive the main container logs as an artifact
    archiveLogs: true
    s3:
      bucket: "xxx"
      endpoint: "s3.amazonaws.com"
      region: "xxx"
      roleARN: "xxx"
      encryptionOptions:
        enableEncryption: true
argo-workflows-config:
  enabled: true
  namespaceOverride: xxx
  ingress:
    host: "<argo workflow host>"
    apisix:
      enabled: true
      apisixPluginConfigName: xxxx
      sessionSecret: "xxx"
      sessionCookieName: "xxx"
  server:
    servicePort: xxx

@agilgur5 agilgur5 added the area/archive-logs Archive Logs feature label Jun 24, 2024
@agilgur5
Copy link
Member

agilgur5 commented Jun 24, 2024

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

You didn't provide a Workflow that reproduces this. Please follow the instructions in the issue template.

My configuration is here:

This looks like a Helm configuration, but you didn't say what Helm chart or what version of that Chart you're using. It looks like argo-helm from a glance, but not entirely.

argo-workflows-config:

For instance, this looks like something custom?

3.5.7

How did your Workflow fail? Without a repro we cannot determine this. Some failure modes cannot have archive logs or artifacts in general, as the failure kills the wait container, for instance.

It's possible this is partially solved by #12413, which you can enable with ARGO_POD_STATUS_CAPTURE_FINALIZER in :latest (eventually in 3.6). Otherwise it could also have the same root cause as #12993; if a Pod is hard terminated without a long enough grace period, it would be expected that some artifacts may not have been properly saved or saved at all, because the wait container was killed (see also #13066 (comment)).

Also note that the docs do explicitly recommend against using Archive Logs and strongly recommend using proper logging tools. For example, fluentd is not reliant on your Pod's failure mode to collect logs (although it has its own failure modes) as it runs as a DaemonSet with its own agent Pods on all nodes (with elevated permissions).

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Jun 24, 2024
@agilgur5 agilgur5 changed the title Argo workflow cannot archive the logs for failed pod No archive logs for failed pod Jun 24, 2024
@scany1211
Copy link
Author

scany1211 commented Jun 25, 2024

hi @agilgur5 ,
Thanks for your prompt response, our workflow is very long, I only can post steps which is failed. And you might not be able to reproduce, because it's related with our service.
The failure of pods are reasonable, the error pod is not terminated immediately, I think they received some SIGKILL signal, but no archive logs are stored in our configured S3 .
Besides, our argo-helm chart version is 6.9.1, argocd -apps helm chart is version 1.6.2.
I haven't find the configuration about "enable with ARGO_POD_STATUS_CAPTURE_FINALIZER in :latest", would you mind give me more detail, thanks

Name:         becareful-pipeline-h74k5
Namespace:    processing
Labels:       workflows.argoproj.io/completed=true
             workflows.argoproj.io/creator=system-serviceaccount-xxxx-argo-workflows-server
             workflows.argoproj.io/phase=Failed
Annotations:  workflows.argoproj.io/pod-name-format: v2
API Version:  argoproj.io/v1alpha1
Kind:         Workflow
Metadata:
 Creation Timestamp:  2024-06-20T15:16:15Z
 Generate Name:       becareful-pipeline-
 Generation:          39
 Managed Fields:
   API Version:  argoproj.io/v1alpha1
   Fields Type:  FieldsV1
   fieldsV1:
     f:metadata:
       f:generateName:
       f:labels:
         .:
         f:workflows.argoproj.io/creator:
     f:spec:
   Manager:      argo
   Operation:    Update
   Time:         2024-06-21T09:29:13Z
   API Version:  argoproj.io/v1alpha1
   Fields Type:  FieldsV1
   fieldsV1:
     f:metadata:
       f:annotations:
         .:
         f:workflows.argoproj.io/pod-name-format:
       f:labels:
         f:workflows.argoproj.io/completed:
         f:workflows.argoproj.io/phase:
     f:status:
   Manager:         workflow-controller
   Operation:       Update
   Time:            2024-06-21T09:29:23Z
 Resource Version:  8553490
 UID:               e3f6a3cf-c289-481b-bfd9-39ba08400f35
Spec:
 Arguments:
 Entrypoint:  pipelineDag
 On Exit:     exit-handler
 Shutdown:    Stop
 Templates:
   Affinity:
     Node Affinity:
       Required During Scheduling Ignored During Execution:
         Node Selector Terms:
           Match Expressions:
             Key:       type
             Operator:  In
             Values:
               non-gpu-processing
   Container:
     Env:
       Name:             LICENSE_REQUEST_LICENSE_CODE
       Value:            xxxx_pipeline
       Name:             LICENSE_REQUEST_SESSION_UID
       Value:            {{workflow.name}}
       Name:             LICENSE_REQUEST_CHECKOUT
       Value:            true
       Name:             LICENSE_REQUEST_CHECKIN
       Value:            False
       Name:             LICENSE_CHECKER_AUTH_API_ADDRESS
       Value:            http://auth-api-service.apis.svc.cluster.local:xxx
       Name:             LICENSE_CHECKER_AWAIT_TIMEOUT
       Value:            64800
     Image:              xxxx.dkr.ecr.xxxx.amazonaws.com/xxxx/xxxx_internal_license_checker:1.0.1
     Image Pull Policy:  Always
     Name:               
     Resources:
       Limits:
         Cpu:     500m
         Memory:  256Mi
       Requests:
         Cpu:     500m
         Memory:  256Mi
   Inputs:
   Metadata:
     Annotations:
       cluster-autoscaler.kubernetes.io/safe-to-evict:  false
       karpenter.sh/do-not-evict:                       true
   Name:                                                pipeline-license-holder
   Outputs:
   Service Account Name:  xxxx
   Tolerations:
     Effect:  NoExecute
     Key:     type
     Value:   non-gpu-processing
   Affinity:
     Node Affinity:
       Required During Scheduling Ignored During Execution:
         Node Selector Terms:
           Match Expressions:
             Key:       type
             Operator:  In
             Values:
               non-gpu-processing
   Container:
     Env:
       Name:             LICENSE_REQUEST_LICENSE_CODE
       Value:            xxxx_pipeline
       Name:             LICENSE_REQUEST_SESSION_UID
       Value:            {{workflow.name}}
       Name:             LICENSE_REQUEST_CHECKOUT
       Value:            false
       Name:             LICENSE_REQUEST_CHECKIN
       Value:            True
       Name:             LICENSE_CHECKER_AUTH_API_ADDRESS
       Value:            http://auth-api-service.apis.svc.cluster.local:5005
       Name:             LICENSE_CHECKER_AWAIT_TIMEOUT
       Value:            64800
     Image:              xxxx.dkr.ecr.xxxx.amazonaws.com/xxxx/xxxx_internal_license_checker:1.0.1
     Image Pull Policy:  Always
     Name:               
     Resources:
       Limits:
         Cpu:     500m
         Memory:  256Mi
       Requests:
         Cpu:     500m
         Memory:  256Mi
   Inputs:
   Metadata:
     Annotations:
       cluster-autoscaler.kubernetes.io/safe-to-evict:  false
       karpenter.sh/do-not-evict:                       true
       
  Name:                                                vehicle-state-to-enu
   Outputs:
   Service Account Name:  xxxx
   Tolerations:
     Effect:  NoExecute
     Key:     type
     Value:   non-gpu-processing
   Volumes:
     Name:  s3-storage-in
     Persistent Volume Claim:
       Claim Name:  efs-claim
   Container:
     Env:
       Name:             TASK_MSG_PRODUCER_KAFKA_ADDRESS
       Value:            xxxx.xxxx.xxx.c2.kafka.xxxx.amazonaws.com:9092,xxx.xxxx.xxx.c2.kafka.xxxx.amazonaws.com:9092,xxx.xxxx.xxx.c2.kafka.xxxx.amazonaws.com:9092
       Name:             TASK_MSG_PRODUCER_KAFKA_TOPIC
       Value:            task_done
       Name:             TASK_MSG_METRIC_IMAGE
       Value:            xxxx/vehicle_gps_to_enu
       Name:             TASK_MSG_VERSION
       Value:            1.0.0
       Name:             TASK_MSG_INPUT_PATH
       Value:            tasks_api/out/58ca041d1f9bce417246a7365df4ef/2024_06_20_15_16_08
       Name:             TASK_MSG_DATALOG_HASH
       Value:            xxxxx
       Name:             TASK_MSG_OUTPUT_PATH
       Value:            tasks_api/out/ee21e160b6116617f179a554cd0511/2024_06_20_15_16_08
       Name:             TASK_MSG_EXECUTION_INFO_UID
       Value:            xxxxx
       Name:             TASK_MSG_OUTPUT_PVC_CLAIM
       Value:            efs-claim
     Image:              xxxx.dkr.ecr.xxxx.amazonaws.com/xxxx/xxxx_internal_task_msg_producer:1.2.1
     Image Pull Policy:  Always
     Name:               
     Resources:
   Inputs:
   Metadata:
     Annotations:
       cluster-autoscaler.kubernetes.io/safe-to-evict:  false
       karpenter.sh/do-not-evict:                       true
   Name:                                                vehicle-state-to-enu-completion-notification
   Outputs:
   Affinity:
     Node Affinity:
       Required During Scheduling Ignored During Execution:
         Node Selector Terms:
           Match Expressions:
             Key:       type
             Operator:  In
             Values:
               non-gpu-processing
   Container:
     Env:
       Name:             LICENSE_REQUEST_LICENSE_CODE
       Value:            xxxx_compute
       Name:             LICENSE_REQUEST_SESSION_UID
       Value:            {{workflow.name}}-vehicle-state-to-enu
       Name:             LICENSE_REQUEST_CHECKOUT
       Value:            False
       Name:             LICENSE_REQUEST_CHECKIN
       Value:            True
       Name:             LICENSE_CHECKER_AUTH_API_ADDRESS
       Value:            http://auth-api-service.apis.svc.cluster.local:5005
       Name:             LICENSE_CHECKER_AWAIT_TIMEOUT
       Value:            3600
     Image:              xxxx.dkr.ecr.xxxx.amazonaws.com/xxxx/xxxx_internal_license_checker:1.0.1
     Image Pull Policy:  Always
     Name:               
     Resources:
       Limits:
         Cpu:     500m
         Memory:  256Mi
       Requests:
         Cpu:     500m
         Memory:  256Mi
   Inputs:
   Metadata:
     Annotations:
       cluster-autoscaler.kubernetes.io/safe-to-evict:  false
       karpenter.sh/do-not-evict:                       true

@agilgur5
Copy link
Member

agilgur5 commented Jun 25, 2024

not actionable, very far from an MVCE

our workflow is very long, I only can post steps which is failed. And you might not be able to reproduce, because it's related with our service.

I'm sorry, but there isn't much contributors or maintainers can do to help if you can't provide a reproduction nor an error message or similar. That's why it's asked for in nearly all OSS projects. Keep in mind maintainers are also typically overworked and understaffed or underpaid -- we have limited time, so please use it wisely if you get access to it. I am also personally an unpaid volunteer/hobbyist.

Your issue could very well be a misconfiguration or an issue due to other things you run in your cluster. We have no way of knowing what the issue is given the little information you provided.
The steps you provided, which are incomplete and from describe, honestly do not really help other than tell me you have a lot of stuff going on, which could very well be related.

A minimum reproducible example is essential when asking for help. The problem needs to be isolated independent of any other services and in a small example, otherwise how do you know it's a bug in Argo's code and not elsewhere?

our argo-helm chart version is 6.9.1, argocd -apps helm chart is version 1.6.2.

That's not a valid version of argo-helm/argo-workflows, which currently sits at 0.41.11.

Again it sounds like you're using something custom, which means we cannot know what your values.yaml file results in as we would have no idea what your Chart does. This makes it doubly not reproducible.

I'm not sure how this is related to Argo CD, and I did not ask about that either.
What I did mention is that you have an argo-workflows-config Chart that I directly quoted in my previous comment and that is not a Chart that argo-helm maintains, so again, not reproducible. We have no idea what that does.

At this point the config & examples are very far from minimal, so I will close the issue until a necessary minimal reproduction can be provided. Without that, this issue is not actionable; there's nothing we can realistically do given the limited information, and cannot even confirm a bug let alone root cause and fix it.

I can answer some more of your questions below, but that's really all I can do at this point.


further questions

I haven't find the configuration about "enable with ARGO_POD_STATUS_CAPTURE_FINALIZER in :latest", would you mind give me more detail, thanks

I did provide a link to the PR, #12413, which has all possible information and links out to the original issue etc. You can also see the environment variable and its description in the latest docs.

The failure of pods are reasonable, the error pod is not terminated immediately, I think they received some SIGKILL signal, but no archive logs are stored in our configured S3 .

Again it is hard to tell with the limited amount of information, but that does suggest a likely similar root cause as #12993, as a SIGKILL means the Pod was indeed interrupted. If there was also a SIGTERM, it would almost certainly confirm that.
Whether the SIGKILL had a long enough grace period for the wait container to store all artifacts is impossible to tell with the limited information. You can try raising your terminationGracePeriodSeconds to try to mitigate that if that is the case (although that only affects k8s signals, and not signals from other services you may or may not have).

@agilgur5 agilgur5 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/archive-logs Archive Logs feature problem/more information needed Not enough information has been provide to diagnose this issue. type/bug
Projects
None yet
Development

No branches or pull requests

2 participants