Skip to content

Conversation

@mivanov1988
Copy link
Contributor

Why

We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What

Added validation for an already completed job in a more appropriate place.

Testing Done

Added integration test

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com

Why
We recently got the following feedback from our internal client: A data job was listed as successful even though it hit the 12 hour limit and was killed; the logs do not show that either - the last entry in the log just shows the last object that was sent for ingestion, but there is no summary of the data job.

The problem is caused by the following fix - #1586.

When the job hit the 12-hour limit the K8S Pod is terminated and we construct partial JobExecutionStatus which enters in the following if statement and returns Optional.empty() rather than the constructed object.

https://github.com/vmware/versatile-data-kit/blob/4763ba877f43b270fbd4770bc1533216f7c5d618/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/KubernetesService.java#L1656

As a result, this job execution becomes stuck in the Running status until it is detected by emergency logic, which marks such executions as successful due to the lack of associated Pods to them.

What
Added validation for an already completed job in a more appropriate place.

Testing Done
Added integration test

Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
@mivanov1988 mivanov1988 enabled auto-merge (squash) May 25, 2023 12:04
@mivanov1988 mivanov1988 merged commit 67e739c into main May 25, 2023
@mivanov1988 mivanov1988 deleted the person/miroslavi/killed-job-was-shown-as-successful2 branch May 25, 2023 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants