Initial TaskRun Reconcile Refactor #1866

adambkaplan · 2025-04-16T21:57:08Z

Changes

This is a refactor of how we reconcile TaskRun objects into BuildRun objects. The intent is to create useful abstractions of the TaskRun's lifecycle so that we can extend these principles to other Kubernetes object types.

/kind cleanup

Submitter Checklist

Includes tests if functionality changed/was added
Includes docs if changes are user-facing
Set a kind label on this PR
Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

NONE

adambkaplan · 2025-04-21T14:35:33Z

pkg/reconciler/buildrun/buildrun.go

+					// taskrun pod ramp-up (time between pod creation and last init container completion)
+					buildmetrics.TaskRunPodRampUpDurationObserve(


Is this something we can advocate for in the Tekton community? This exposes metrics that are specific to an underlying implementation, making it difficult to use a different API for reconciling BuildRuns.

You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

I tried not to change the behavior (too much), and perhaps with the "BuildRunner" abstraction what is being reported is less clear.

The BuildRunner's GetCreationTimestamp() method returns the creation timestamp of the object executing the build. In this case, the creation timestamp of the Tekton TaskRun. So TaskRunRampUpObserve records the duration between when we created the TaskRun and Tekton created the underlying pod. So I don't think I am modifying the current metric that is reported.

Similar story for the TaskRunPodRampUp metric - it is recording the time between when the pod is created and when the pod's last init container terminates. Side note: this might break if Tekton changes its sidecar behavior to use true Kubernetes "sidecars".

The main behavior change (which actually required me to modify a unit test) is that computing these timestamps makes two kube apiserver calls, instead of one.

adambkaplan · 2025-04-21T14:39:01Z

pkg/reconciler/buildrun/buildrun.go

+				// taskrun ramp-up duration (time between taskrun creation and taskrun pod creation)
+				buildmetrics.TaskRunRampUpDurationObserve(


Similar issue - this metric is tied to an underlying implementation. In theory a runner's API does not need to create Kubernetes pods, it just needs to represent the execution of a workload. For example, a KubeVirt VirtualMachineInstance: https://kubevirt.io/user-guide/user_workloads/lifecycle/

adambkaplan · 2025-04-21T14:42:31Z

pkg/reconciler/buildrun/runner/taskrun.go

+		generatedTaskRun *pipelineapi.TaskRun
+	)
+
+	generatedTaskRun, err := resources.GenerateTaskRun(t.config, build, buildRun, serviceAccount.Name, strategy)


For now, I'm leaving the existing logic for creating TaskRuns in place. I plan on breaking this logic up and moving it to the TaskRunBuildRunner struct (or something similar). This PR is big enough as it is.

adambkaplan · 2025-04-21T14:44:46Z

pkg/reconciler/buildrun/runner/taskrun.go

+	// Set OwnerReference for BuildRun and TaskRun
+	if err := controllerutil.SetOwnerReference(buildRun, generatedTaskRun, t.scheme); err != nil {
+		return nil, err
+	}


Leaving this here for now, but I may move the setting of owner references to the main Reconcile function in a subsequent PR.

adambkaplan · 2025-04-21T20:16:27Z

/assign @SaschaSchwarze0

/cc @HeavyWombat @karanibm6

SaschaSchwarze0 · 2025-05-10T13:17:08Z

pkg/reconciler/buildrun/runner/types.go

+	return e.Message
+}
+
+func (e *BuildRunnerValidationError) IsTerminal() bool {


godoc please, currently doing the first pass and I cannot imagine what this is for - good indicator it requires docs.

SaschaSchwarze0 · 2025-05-10T13:18:26Z

pkg/reconciler/buildrun/runner/taskrun.go

@@ -0,0 +1,491 @@
+package runner


I wonder how the validate check is green given this lacks copyright.

SaschaSchwarze0 · 2025-05-10T13:22:40Z

pkg/reconciler/buildrun/runner/taskrun.go

+	buildSpec := buildrun.Status.BuildSpec
+
+	if buildSpec.Source == nil {
+		return
+	}


This is not accurate. status.buildSpec is a copy of the Build spec. But the real source may be the BuildRun source. For example, in case of a local source, this causes an overwrite.

This might be an existing bug/issue. The lines here are a direct copy from https://github.com/shipwright-io/build/blob/main/pkg/reconciler/buildrun/resources/sources.go#L80-L84

SaschaSchwarze0 · 2025-05-10T13:24:33Z

pkg/reconciler/buildrun/runner/taskrun.go

+			buildRun.Status.Output.Digest = result.Value.StringVal
+
+		case generateOutputResultName(imageSizeResult):
+			// TODO: Throw error that should not be retried?


I suggest to log a warning if the conversion here fails. Usually means that the build strategy is incorrect in the case of strategy-managed push.

SaschaSchwarze0 · 2025-05-10T13:25:49Z

pkg/reconciler/buildrun/runner/taskrun.go

+		}
+
+	case pipelineapi.TaskRunReasonTimedOut:
+		reason = "BuildRunTimeout"


We have no constant for this ?

Apparently not - my search for BuildRunTimeout only returns refrences to the string and not assignment to a constant. Promoting this to the API package is a good idea.

SaschaSchwarze0 · 2025-05-10T13:43:29Z

pkg/reconciler/buildrun/buildrun.go

+					// taskrun pod ramp-up (time between pod creation and last init container completion)
+					buildmetrics.TaskRunPodRampUpDurationObserve(


You are changing the behavior here. We so far measured the duration from the time when we created the object that does the work (TaskRun) up to when we knew it would start performing the strategy steps (end of TaskRun init containers).

Now you measure pod creation as start and take out the Tekton controller and its time to create a Pod from a TaskRun completely.

If we decide that we want to use different objects to run the BuildRun and want to continue to provide such a metric, then we'd have to rename it for sure. But if we want to keep something like this, then we should not change it. You could add a CalculateRampUpDuration function on the BuildRunner.

openshift-ci · 2025-05-14T18:02:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from saschaschwarze0. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adambkaplan · 2025-05-14T18:10:32Z

I think I fixed all the issues identified in the review (or addressed them in comments). PTAL.

Use standard controller-runtime fake client for unit tests. Signed-off-by: Adam Kaplan <adam.kaplan@redhat.com>

Initial refactor to use a factory pattern when generating the Tekton TaskRun for a Shipwright BuildRun. This starts us down the path of using other k8s objects besides TaskRuns to execute the build process. Other possibilities include PipelineRuns, Argo Workflows, and even standard k8s objects (Pod, Job). Key logic moved out of the main BuildRun reconciler and `resources` package: - Cancellation of underlying runner (TaskRun) - Syncing of status to the BuildRun object - Sync of status from TaskRun results. - Sync of status from TaskRun conditions. - Sync of status from failure messages recorded in termination logs. - Report timestamps of started, completed, and other lifecycle events: - Pod creation time - Pod init containers finished - Execution start time - Migrate unit tests to use controller-runtime fake client - Move status condition update calls back to the main reconciler. Further abstractions are likely needed to ensure we properly generate the execution objects for other APIs. Signed-off-by: Adam Kaplan <adam.kaplan@redhat.com>

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2025

pull-request-size bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2025

openshift-ci bot added the release-note Label for when a PR has specified a release note label Apr 16, 2025

openshift-ci bot requested review from HeavyWombat and SaschaSchwarze0 April 16, 2025 21:57

shipwright-ci-bot added this to Shipwright Overview Apr 16, 2025

pull-request-size bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 17, 2025

adambkaplan force-pushed the taskrun-reconcile-refactor branch 2 times, most recently from 8e1f2c5 to eca85d0 Compare April 21, 2025 14:19

openshift-ci bot added release-note-none Label for when a PR does not need a release note and removed release-note Label for when a PR has specified a release note labels Apr 21, 2025

adambkaplan changed the title ~~WIP - Taskrun Reconcile Refactor~~ Initial TaskRun Reconcile Refactor Apr 21, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 21, 2025

adambkaplan added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Apr 21, 2025

adambkaplan commented Apr 21, 2025

View reviewed changes

openshift-ci bot assigned SaschaSchwarze0 Apr 21, 2025

openshift-ci bot requested a review from karanibm6 April 21, 2025 20:16

SaschaSchwarze0 requested changes May 10, 2025

View reviewed changes

adambkaplan force-pushed the taskrun-reconcile-refactor branch from eca85d0 to f1ec968 Compare May 14, 2025 18:02

vendor: Add controller-runtime/pkg/client/fake Package

d4e7c4c

Use standard controller-runtime fake client for unit tests. Signed-off-by: Adam Kaplan <adam.kaplan@redhat.com>

adambkaplan force-pushed the taskrun-reconcile-refactor branch from f1ec968 to 0b98316 Compare May 15, 2025 18:56

adambkaplan force-pushed the taskrun-reconcile-refactor branch from 0b98316 to 5f76390 Compare May 29, 2025 20:50

adambkaplan mentioned this pull request Jul 1, 2025

WIP: Refactor TaskRuns out of BuildRun Controller #1924

Open

4 tasks

		// taskrun pod ramp-up (time between pod creation and last init container completion)
		buildmetrics.TaskRunPodRampUpDurationObserve(

		// taskrun ramp-up duration (time between taskrun creation and taskrun pod creation)
		buildmetrics.TaskRunRampUpDurationObserve(

Initial TaskRun Reconcile Refactor #1866

Are you sure you want to change the base?

Initial TaskRun Reconcile Refactor #1866

Uh oh!

Conversation

adambkaplan commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Submitter Checklist

Release Notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adambkaplan May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adambkaplan commented Apr 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 14, 2025

Uh oh!

adambkaplan commented May 14, 2025

Uh oh!

Uh oh!

adambkaplan commented Apr 16, 2025 •

edited

Loading

adambkaplan May 14, 2025 •

edited

Loading