Description
The bug
The changes from PR #576 introduce a "shortcut" in the behavior of the sparkapplication FSM and cause termination time not being reported in the sparkapplication resource.
To be more precise, status.terminationTime
is nil
when the sparkapplication is finished.
Context
- Spark operator 1.0.1 & 1.1.1, probably 1.1.0 too
- Kubernetes:
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
The reason
Here is the normal FSM flow for a sparkapplication without sidecar and for a sparkapplication with sidecar before PR #576 :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SparkApplicationAdded 35s spark-operator SparkApplication pyspark-pi was added, enqueuing it for submission
Normal SparkApplicationSubmitted 32s spark-operator SparkApplication pyspark-pi was submitted successfully
Normal SparkDriverRunning 30s spark-operator Driver pyspark-pi-driver is running
Normal SparkExecutorPending 23s spark-operator Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is pending
Normal SparkExecutorRunning 22s spark-operator Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is running
Normal SparkDriverCompleted 14s spark-operator Driver pyspark-pi-driver completed
Normal SparkApplicationCompleted 14s spark-operator SparkApplication pyspark-pi completed
PR #576 makes the application state turn to CompletedState
(via SucceedingState
) once the driver container is terminated, regardless of the status of sidecars and thus regardless of the status of the pod.
For most use cases of sidecars, the driver container finishes before the sidecars. So effectively, PR #576 makes the sparkapplication turn to CompletedState
before the pod is terminated.
The problem is that status.terminationTime
is filled out when the app is in RunningState
and the driver pod is terminated (see here). And this never happens anymore.
To be more concrete, here is the FSM flow after PR #576 for a sparkapplication with sidecars finishing after the driver container (again, normal case) :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SparkApplicationAdded 65s spark-operator SparkApplication pyspark-pi was added, enqueuing it for submission
Normal SparkApplicationSubmitted 62s spark-operator SparkApplication pyspark-pi was submitted successfully
Normal SparkExecutorPending 54s spark-operator Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is pending
Normal SparkExecutorRunning 52s spark-operator Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is running
Normal SparkDriverRunning 45s (x2 over 60s) spark-operator Driver pyspark-pi-driver is running
Normal SparkApplicationCompleted 45s spark-operator SparkApplication pyspark-pi completed
Note that the sparkapplication never records a SparkDriverCompleted
event.
How to solve?
Option 1: revert changes from #576
We could consider that the Spark application is finished only when the driver container and all its sidecars have finished. This would mean reverting the changes of PR #576.
I would argue this is the better and simpler option. The changes of PR #576 mess with the FSM flow by adding a third state machine, the driver container state (before, only the driver pod state and the sparkapplication state were considered).
But since this PR has been merged, there must have been good reasons.
Option 2: harmonize the end of life of spark applications
Currently, there are two ways an app can finish:
Termination time is only updated in case 2.
We could factorize the code and harmonize those two cases.
I'm not sure my analysis is sound as it is the first time I dig into the operator's code.
Also, I've never written any Go, but I could give it a try if needed!
Activity