Skip to content

Initial TaskRun Reconcile Refactor #1866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adambkaplan
Copy link
Member

@adambkaplan adambkaplan commented Apr 16, 2025

Changes

This is a refactor of how we reconcile TaskRun objects into BuildRun objects. The intent is to create useful abstractions of the TaskRun's lifecycle so that we can extend these principles to other Kubernetes object types.

/kind cleanup

Submitter Checklist

  • Includes tests if functionality changed/was added
  • Includes docs if changes are user-facing
  • Set a kind label on this PR
  • Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2025
@pull-request-size pull-request-size bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2025
@openshift-ci openshift-ci bot added the release-note Label for when a PR has specified a release note label Apr 16, 2025
Copy link
Contributor

openshift-ci bot commented Apr 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign heavywombat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pull-request-size pull-request-size bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 17, 2025
Use standard controller-runtime fake client for unit tests.

Signed-off-by: Adam Kaplan <adam.kaplan@redhat.com>
@adambkaplan adambkaplan force-pushed the taskrun-reconcile-refactor branch from 6ce0e33 to 8e1f2c5 Compare April 19, 2025 14:25
Initial refactor to use a factory pattern when generating the Tekton
TaskRun for a Shipwright BuildRun. This starts us down the path of
using other k8s objects besides TaskRuns to execute the build process.
Other possibilities include PipelineRuns, Argo Workflows, and even
standard k8s objects (Pod, Job).

Key logic moved out of the main BuildRun reconciler and `resources`
package:

- Cancellation of underlying runner (TaskRun)
- Syncing of status to the BuildRun object
- Sync of status from TaskRun results.
- Sync of status from TaskRun conditions.
- Sync of status from failure messages recorded in termination logs.
- Report timestamps of started, completed, and other lifecycle
  events:
  - Pod creation time
  - Pod init containers finished
  - Execution start time
- Migrate unit tests to use controller-runtime fake client
- Move status condition update calls back to the main reconciler.

Further abstractions are likely needed to ensure we properly generate
the execution objects for other APIs.

Signed-off-by: Adam Kaplan <adam.kaplan@redhat.com>
@adambkaplan adambkaplan force-pushed the taskrun-reconcile-refactor branch from 8e1f2c5 to eca85d0 Compare April 21, 2025 14:19
@openshift-ci openshift-ci bot added release-note-none Label for when a PR does not need a release note and removed release-note Label for when a PR has specified a release note labels Apr 21, 2025
@adambkaplan adambkaplan changed the title WIP - Taskrun Reconcile Refactor Initial TaskRun Reconcile Refactor Apr 21, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2025
@adambkaplan adambkaplan added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 21, 2025
Comment on lines +486 to +487
// taskrun pod ramp-up (time between pod creation and last init container completion)
buildmetrics.TaskRunPodRampUpDurationObserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can advocate for in the Tekton community? This exposes metrics that are specific to an underlying implementation, making it difficult to use a different API for reconciling BuildRuns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

Comment on lines +476 to +477
// taskrun ramp-up duration (time between taskrun creation and taskrun pod creation)
buildmetrics.TaskRunRampUpDurationObserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue - this metric is tied to an underlying implementation. In theory a runner's API does not need to create Kubernetes pods, it just needs to represent the execution of a workload. For example, a KubeVirt VirtualMachineInstance: https://kubevirt.io/user-guide/user_workloads/lifecycle/

generatedTaskRun *pipelineapi.TaskRun
)

generatedTaskRun, err := resources.GenerateTaskRun(t.config, build, buildRun, serviceAccount.Name, strategy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm leaving the existing logic for creating TaskRuns in place. I plan on breaking this logic up and moving it to the TaskRunBuildRunner struct (or something similar). This PR is big enough as it is.

Comment on lines +473 to +476
// Set OwnerReference for BuildRun and TaskRun
if err := controllerutil.SetOwnerReference(buildRun, generatedTaskRun, t.scheme); err != nil {
return nil, err
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this here for now, but I may move the setting of owner references to the main Reconcile function in a subsequent PR.

@adambkaplan
Copy link
Member Author

/assign @SaschaSchwarze0

/cc @HeavyWombat @karanibm6

return e.Message
}

func (e *BuildRunnerValidationError) IsTerminal() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc please, currently doing the first pass and I cannot imagine what this is for - good indicator it requires docs.

@@ -0,0 +1,491 @@
package runner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how the validate check is green given this lacks copyright.

Comment on lines +177 to +181
buildSpec := buildrun.Status.BuildSpec

if buildSpec.Source == nil {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate. status.buildSpec is a copy of the Build spec. But the real source may be the BuildRun source. For example, in case of a local source, this causes an overwrite.

buildRun.Status.Output.Digest = result.Value.StringVal

case generateOutputResultName(imageSizeResult):
// TODO: Throw error that should not be retried?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to log a warning if the conversion here fails. Usually means that the build strategy is incorrect in the case of strategy-managed push.

}

case pipelineapi.TaskRunReasonTimedOut:
reason = "BuildRunTimeout"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no constant for this ?

Comment on lines +486 to +487
// taskrun pod ramp-up (time between pod creation and last init container completion)
buildmetrics.TaskRunPodRampUpDurationObserve(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Label for when a PR does not need a release note size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants