Termination time is not reported when using sidecars

## The bug

The changes from PR #576 introduce a "shortcut" in the behavior of the sparkapplication FSM and cause termination time not being reported in the sparkapplication resource.

To be more precise, `status.terminationTime` is `nil` when the sparkapplication is finished.

## Context

* Spark operator 1.0.1 & 1.1.1, probably 1.1.0 too
* Kubernetes:
```
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
```

## The reason

Here is the normal FSM flow for a sparkapplication without sidecar **and** for a sparkapplication with sidecar before PR #576 :
```
Events:
  Type    Reason                     Age   From            Message
  ----    ------                     ----  ----            -------
  Normal  SparkApplicationAdded      35s   spark-operator  SparkApplication pyspark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  32s   spark-operator  SparkApplication pyspark-pi was submitted successfully
  Normal  SparkDriverRunning         30s   spark-operator  Driver pyspark-pi-driver is running
  Normal  SparkExecutorPending       23s   spark-operator  Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is pending
  Normal  SparkExecutorRunning       22s   spark-operator  Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is running
  Normal  SparkDriverCompleted       14s   spark-operator  Driver pyspark-pi-driver completed
  Normal  SparkApplicationCompleted  14s   spark-operator  SparkApplication pyspark-pi completed
```

PR #576 makes the application state turn to `CompletedState` (via `SucceedingState`) once the driver container is terminated, regardless of the status of sidecars and thus regardless of the status of the pod.

**For most use cases of sidecars**, the driver container finishes before the sidecars. So effectively, PR #576 makes the sparkapplication turn to `CompletedState` before the pod is terminated.

The problem is that `status.terminationTime` is filled out when the app is in `RunningState` and the driver pod is terminated (see [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/controller.go#L328)). And this never happens anymore.

To be more concrete, here is the FSM flow after PR #576 for a sparkapplication with sidecars finishing after the driver container (again, normal case) :

```
Events:
  Type    Reason                     Age                From            Message
  ----    ------                     ----               ----            -------
  Normal  SparkApplicationAdded      65s                spark-operator  SparkApplication pyspark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  62s                spark-operator  SparkApplication pyspark-pi was submitted successfully
  Normal  SparkExecutorPending       54s                spark-operator  Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is pending
  Normal  SparkExecutorRunning       52s                spark-operator  Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is running
  Normal  SparkDriverRunning         45s (x2 over 60s)  spark-operator  Driver pyspark-pi-driver is running
  Normal  SparkApplicationCompleted  45s                spark-operator  SparkApplication pyspark-pi completed
```

Note that the sparkapplication never records a `SparkDriverCompleted` event.

## How to solve?

### Option 1: revert changes from #576 

We could consider that the Spark application is finished only when the driver container and all its sidecars have finished. This would mean reverting the changes of PR #576.
I would argue this is the better and simpler option. The changes of PR #576 mess with the FSM flow by adding a third state machine, the driver container state (before, only the driver pod state and the sparkapplication state were considered).

But since this PR has been merged, there must have been good reasons.

### Option 2: harmonize the end of life of spark applications

Currently, there are two ways an app can finish:
1. the driver container finishes, see [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/sparkapp_util.go#L100)
2. the driver pod finishes, see [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/pkg/controller/sparkapplication/controller.go#L328)

Termination time is only updated in case 2.

We could factorize the code and harmonize those two cases.



I'm not sure my analysis is sound as it is the first time I dig into the operator's code.
Also, I've never written any Go, but I could give it a try if needed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Termination time is not reported when using sidecars #841

The bug

Context

The reason

How to solve?

Option 1: revert changes from #576

Option 2: harmonize the end of life of spark applications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Termination time is not reported when using sidecars #841

Description

The bug

Context

The reason

How to solve?

Option 1: revert changes from #576

Option 2: harmonize the end of life of spark applications

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions