Skip to content

Termination time is not reported when using sidecars #841

Closed
@jrj-d

Description

@jrj-d

The bug

The changes from PR #576 introduce a "shortcut" in the behavior of the sparkapplication FSM and cause termination time not being reported in the sparkapplication resource.

To be more precise, status.terminationTime is nil when the sparkapplication is finished.

Context

  • Spark operator 1.0.1 & 1.1.1, probably 1.1.0 too
  • Kubernetes:
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

The reason

Here is the normal FSM flow for a sparkapplication without sidecar and for a sparkapplication with sidecar before PR #576 :

Events:
  Type    Reason                     Age   From            Message
  ----    ------                     ----  ----            -------
  Normal  SparkApplicationAdded      35s   spark-operator  SparkApplication pyspark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  32s   spark-operator  SparkApplication pyspark-pi was submitted successfully
  Normal  SparkDriverRunning         30s   spark-operator  Driver pyspark-pi-driver is running
  Normal  SparkExecutorPending       23s   spark-operator  Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is pending
  Normal  SparkExecutorRunning       22s   spark-operator  Executor pyspark-pi-4c5b8270fd5abd06-exec-1 is running
  Normal  SparkDriverCompleted       14s   spark-operator  Driver pyspark-pi-driver completed
  Normal  SparkApplicationCompleted  14s   spark-operator  SparkApplication pyspark-pi completed

PR #576 makes the application state turn to CompletedState (via SucceedingState) once the driver container is terminated, regardless of the status of sidecars and thus regardless of the status of the pod.

For most use cases of sidecars, the driver container finishes before the sidecars. So effectively, PR #576 makes the sparkapplication turn to CompletedState before the pod is terminated.

The problem is that status.terminationTime is filled out when the app is in RunningState and the driver pod is terminated (see here). And this never happens anymore.

To be more concrete, here is the FSM flow after PR #576 for a sparkapplication with sidecars finishing after the driver container (again, normal case) :

Events:
  Type    Reason                     Age                From            Message
  ----    ------                     ----               ----            -------
  Normal  SparkApplicationAdded      65s                spark-operator  SparkApplication pyspark-pi was added, enqueuing it for submission
  Normal  SparkApplicationSubmitted  62s                spark-operator  SparkApplication pyspark-pi was submitted successfully
  Normal  SparkExecutorPending       54s                spark-operator  Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is pending
  Normal  SparkExecutorRunning       52s                spark-operator  Executor pyspark-pi-0bba6d70fd5787d4-exec-1 is running
  Normal  SparkDriverRunning         45s (x2 over 60s)  spark-operator  Driver pyspark-pi-driver is running
  Normal  SparkApplicationCompleted  45s                spark-operator  SparkApplication pyspark-pi completed

Note that the sparkapplication never records a SparkDriverCompleted event.

How to solve?

Option 1: revert changes from #576

We could consider that the Spark application is finished only when the driver container and all its sidecars have finished. This would mean reverting the changes of PR #576.
I would argue this is the better and simpler option. The changes of PR #576 mess with the FSM flow by adding a third state machine, the driver container state (before, only the driver pod state and the sparkapplication state were considered).

But since this PR has been merged, there must have been good reasons.

Option 2: harmonize the end of life of spark applications

Currently, there are two ways an app can finish:

  1. the driver container finishes, see here
  2. the driver pod finishes, see here

Termination time is only updated in case 2.

We could factorize the code and harmonize those two cases.

I'm not sure my analysis is sound as it is the first time I dig into the operator's code.
Also, I've never written any Go, but I could give it a try if needed!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions