Description
openedon Mar 28, 2023
Requirement
As a Jaeger operator,
I want to be able to limit the execution time of the jaeger-spark-dependencies
Job,
so that I can ensure the Job is not running forever and blocking/wasting resources.
Problem
The spark-dependency spark jobs (the actual Spark jobs inside the JVM) often run into OutOfMemory issues.
The actual problem here is, that the Container does not fail (exit), even though the Spark job already failed.
To solve this issue lasting I have created jaegertracing/spark-dependencies#131 within the spark-dependency repo. However this repos seems not to be maintained anymore (?), hence it would be a improvement to at least be able to limit the execution time of the Pod using Kubernetes specifications. This is currently not feasible for the user since the CronJob is managed by the Jaeger Operator.
Proposal
Set activeDeadlineSeconds
on the Pod-spec to limit the execution time. If the specified amount of time run out before the job finishes, the Pod will be deleted and a new Pod will be created.
Ideally this should be configurable within jaeger.spec.storage.dependencies
. A high default value (8h or 1d) would also be fine, but would be a breaking change in case of (real) long running spark jobs.
This does not solve the problem entirely, but would at least be a mitigation.
Open questions
Is jaegertracing/spark-dependencies still maintained?
-> If yes: it would be better to fix the Job itself jaegertracing/spark-dependencies#131
-> If no: I could open a PR to address this if the proposal sounds good to you.