SparkSubmitOperator on kubernetes. Due to task failure and retry, _spark_exit_code is not 0, which eventually leads to task status failure #44810
Labels
area:core
area:providers
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
provider:apache-spark
Apache Airflow version
2.10.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
I use airflow to schedule spark jobs on k8s using SparkSubmitOperator.The spark job succeeded, but the airflow status is failed
What you think should happen instead?
Due to memory oom exceptions in some tasks, the exit code is generated, resulting in _spark_exit_code not being equal to 0. However, the task will retry itself, and the spark task is ultimately successful. Since _spark_exit_code is not 0, SparkSubmitOperator considers the task status to be a failure. Is it possible to not check _spark_exit_code? The status code returned by the child process shall prevail (returncode)
How to reproduce
You can make the internal task reproduce the problem due to partial OOM failure
Operating System
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: