Controller Impl #5

yhwang · 2024-06-12T06:51:38Z

Use the kubebuilder to generate the skeleton of the controller and implement the first version of EvalJob CRD

Create the EvalJob CRD
Implement the controller to handle the CR objects
Implement a driver to wrap the Python program that runs the lm-eval job
Update the manifests to deploy the CRD, controller, and related resources

Signed-off-by: Yihong Wang <yh.wang@ibm.com>

gabe-l-hart · 2024-06-12T19:15:05Z

api/v1beta1/evaljob_types.go

+// +kubebuilder:subresource:status
+
+// EvalJob is the Schema for the evaljobs API
+type EvalJob struct {


I think we'll want to be careful about how we name the CR. I think EvalJob is too vague since there are numerous things that users may want to evaluate in a kube cluster. I think we could call it LMEvalJob to explicitly link it to the lm-eval python package.

sure thing. This should be related to your other comment about the GroupName and Kind. The reason I have the EvalJob in the first place is that the GroupName is lm-eval-service.github.com.

If we change the GroupName to foundation-model-stack.github.com, then the LMEvalJob is definitely better.

api/v1beta1/evaljob_types.go

api/v1beta1/evaljob_webhook.go

gabe-l-hart · 2024-06-12T19:27:42Z

api/v1beta1/evaljob_webhook_test.go

+	Context("When creating EvalJob under Defaulting Webhook", func() {
+		It("Should fill in the default value if a required field is empty", func() {
+
+			// TODO(user): Add your logic here


Looks like we should do these!

gabe-l-hart · 2024-06-12T19:30:28Z

api/v1beta1/groupversion_info.go

+
+var (
+	// GroupVersion is group version used to register these objects
+	GroupVersion = schema.GroupVersion{Group: "lm-eval-service.github.com", Version: "v1beta1"}


I wonder if we should use something like foundation-model-stack.github.com as the group with LMEvalJob as the kind. Feels like we may have other kinds we want to bundle under this. I could also see not coupling this to other projects in the org. If we go that route, I'd drop the "service" since this is not managing the REST microservice.

I will change the GroupName to foundation-model-stack.github.com and Kind to LMEvalJob. Then we can expand to other kinds for other services in the future.

gabe-l-hart · 2024-06-12T19:37:23Z

api/v1beta1/evaljob_types.go

+type EvalJobSpec struct {
+	// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
+	// Important: Run "make" to regenerate code after modifying this file
+


All of these fields under spec relate to the CLI args of the job. Do you think we need any further config around the job itself? I would think at a minimum we'd need:

image name

resources

passthrough labels and annotations

config for mounting storage volumes for inputs/outputs

We'll also eventually need the ability to run sidecars and the like inside the job (e.g. for proprietary storage connectivity)

I do use ConfigMap to store some settings which may relate to the things you listed. We definitely need to expand the settings to support more configuration-wide resources.

gabe-l-hart · 2024-06-12T19:43:20Z

backend/controller/evaljob_controller.go

+}
+
+type ServiceOptions struct {
+	PodImage             string


Is there anywhere that this can be set in the CRD that I'm not seeing?

nope, it's not linked to any field in the CRD yet. I use ConfigMap to store these settings for now. we can have further discussion about how this should be mapped to the REST API and how it converts to CRD fields and etc.

gabe-l-hart · 2024-06-12T19:47:00Z

backend/controller/evaljob_controller.go

+		return ctrl.Result{}, client.IgnoreNotFound(err)
+	}
+
+	if !evalJob.ObjectMeta.DeletionTimestamp.IsZero() {


Interesting. Is this how a finalizer is triggered for a kubebuilder controller?

actually, for the controller, it's reconciliation fashion, which means you don't know which event triggers the reconcile function. and the DeletionTimestamp is the field that says the current state of the CR is being deleted. After the CR is deleted, there will be another reconcile call. but I use a predicate to ignore the deletion events since we have nothing to do after the CR has been deleted.

gabe-l-hart · 2024-06-12T19:50:36Z

backend/controller/evaljob_controller.go

+		return r.checkScheduledPod(ctx, log, evalJob)
+	case lmevalservicev1beta1.CompleteJobState:
+		return r.handleComplete(ctx, log, evalJob)
+	case lmevalservicev1beta1.CancelledJobState:


Do we have cancellation separate from deletion? I could see that making sense so that the status and artifacts from a cancelled job. I think we may also want to consider some kind of retention policy for old jobs so that they don't pile up in the kube master.

The Cancelled state is used to trigger the cancellation and is different from the Complete state in the current design. After a job is canceled, it will change to the Complete state and the reason will be Cancelled. Yes, we do need a retention policy but it could be either a separate mechanism from the controller or in the controller. We need to figure out a good way to consolidate reconcile-based and time-based logic.

gabe-l-hart · 2024-06-12T19:51:52Z

backend/controller/evaljob_controller.go

+	client.Client
+	Scheme    *runtime.Scheme
+	Recorder  record.EventRecorder
+	ConfigMap string


How does the user point to this ConfigMap? I'm not seeing it in the CRD (could be missing it somewhere).

I plan to use the ConfigMap to store settings for the controller and the fields in the REST APIs. For the sake of the argument, if there is a size property in the REST API that allows users to specify the pod's spec. we may have a big-size settings in the ConfigMap and it contains the 4000m CPU and 8GB memory settings. and a median and small size settings. So it's not directly used by the CR, but somehow related.

gabe-l-hart · 2024-06-12T19:57:34Z

backend/controller/evaljob_controller.go

+	// construct a new pod and create a pod for the job
+	currentTime := v1.Now()
+	pod := r.createPod(job)
+	if err := r.Create(ctx, pod, &client.CreateOptions{}); err != nil {


I'm not super familiar with how kubebuilder controllers get wired into the operator process. Is there any chance of a race here where the check to see if it's a new CR says "yes," but by the time this r.Create happens it's already been reconciled by another reconciliation process or a sibling operator? I suppose there's always the race that a user might manually make the pod themselves, but we probably don't need to account for that.

ha, you are talking about the scaling of the controller now. for scaling, each controller should work on its own bucket and different CRs should go to different buckets. At least that's how knative works. I will check on how multiple replica of the kubebuilder controller works. For the racing condition, theoretically, one reconciliation would throw an error and the other reconciliation would pass. So we can have a logic to validate the pod status to correct the state of the CR. I noted this down and will add some logic to handle that.

gabe-l-hart · 2024-06-12T20:00:46Z

backend/controller/evaljob_controller.go

+
+	if pod.Status.ContainerStatuses == nil {
+		// wait for the pod to initialize and run the containers
+		return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil


Worth making this requeue timing configurable?

yes, I will put it into the ConfigMap

gabe-l-hart · 2024-06-12T20:07:50Z

backend/controller/evaljob_controller.go

+	for _, cstatus := range pod.Status.ContainerStatuses {
+		if cstatus.Name == "main" {
+			if cstatus.LastTerminationState.Terminated == nil {
+				return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil


These conditionals of when a short-circuit return is used vs when the status is updated without returning and it falls through to the final return are a little tangled. I think this could all be done more simply by pulling the "main" container out instead of looping and then having all conditionals fall through to a single return statement.

make sense. but the containerStatuses is not map-based structure, need to loop through them to get the main one. let me see if I can simplify this part.

gabe-l-hart · 2024-06-12T20:09:39Z

backend/controller/evaljob_controller.go

+
+func (r *EvalJobReconciler) getPod(ctx context.Context, job *lmevalservicev1beta1.EvalJob) (*corev1.Pod, error) {
+	var pod = corev1.Pod{}
+	if err := r.Get(ctx, types.NamespacedName{Namespace: job.Namespace, Name: job.Name}, &pod); err != nil {


This looks like it's using the name of the job as the name of the pod? Is that always going to be the case? Seems like we might want to do some kind of hash suffix to correlate with the kind of naming used for other parent/child resource types like Job or ReplicaSet. This would be a nice shorthand for users to identify the pod as "machine created" vs "human created."

since the pod name is stored in the status subresource. we definitely can have our own naming logic. apology for lazy on this in the first draft. :-p. I put the OwnerReference in the pod, so even from the pod, we know which job it belongs to.

gabe-l-hart · 2024-06-12T20:10:57Z

backend/controller/evaljob_controller.go

+			return &pod, nil
+		}
+	}
+	return nil, fmt.Errorf("pod doesn't have proper entry in the OwnerReferences")


Where does this error go if it triggers? Would this prevent future reconciliations from mucking with the pod? In other words, could a user manually remove the ownerReferences entry to cause the controller to "disown" the pod?

if users manually modify the OwnerReference entry, then the controller won't be able to get Delete events for the pod and no reconcile cycle. That's how the controller works. But with the driver embedded in the pod. We still know whether the job is done or not

gabe-l-hart · 2024-06-12T20:11:28Z

backend/controller/evaljob_controller.go

+			APIVersion: "v1",
+		},
+		ObjectMeta: v1.ObjectMeta{
+			Name:      job.Status.PodName,


Interesting, here it looks like we're pulling the pod name from status.podName

yep, store it here. so we can have some naming logic for the pod. and because of the reconcile mechanism, we can only use the data in the LMEvalJob data struct to get the corresponding Pod. the OwnerRefernece is for watching the deletion event from the pod

gabe-l-hart · 2024-06-12T20:18:49Z

backend/controller/evaljob_controller.go

+			fmt.Sprintf("Tthe EvalJob %s in namespace %s has completed",
+				job.Name,
+				job.Namespace))
+		// TODO: final wrap up/clean up


This will be interesting. I suspect one thing that would be really important is persisting the pod's logs somewhere that the user would be responsible for cleaning up. Those could probably get pretty verbose, but it would be great to be able to store the pod logs with the job somehow.

The internet is not very helpful, which indicates that it's probably not possible, but it would be really cool if we could somehow extend the kube logging architecture such that kubectl logs lmevaljob/<jobname> would do the same thing as kubectl logs <job pod name>. Probably not a thing though!

interesting idea. right now, the stderr/stdout are stored in the specific location in the pod and the driver only retrieves the results.json. the logs would only contain information/messages from the driver for now.

we can adjust the behavior in the driver and store the logs somewhere else. Not sure if we can customize the kubectl get lmevaljob command to get logs. I know you can customize the get in the CRD but the info should come from its status.

gabe-l-hart · 2024-06-12T20:20:58Z

backend/controller/evaljob_controller.go

+}
+
+func (r *EvalJobReconciler) handleCancel(ctx context.Context, log logr.Logger, job *lmevalservicev1beta1.EvalJob) (ctrl.Result, error) {
+	// delete the pod and update the state to complete


I think the same concern here around log retention would be at play. A user may want to retain the logs from the pod even after cancellation so that they can analyze the progress/problems from the job.

got it. we can do that in the driver directly.

gabe-l-hart · 2024-06-12T20:21:52Z

backend/controller/evaljob_controller.go

+		log.Error(err, "failed to update status for cancellation")
+	}
+	r.Recorder.Event(job, "Normal", "Cancelled",
+		fmt.Sprintf("Tthe EvalJob %s in namespace %s has cancelled and changed its state to Complete",


NIT: s,Tthe,The,

will fix this in the new PR

gabe-l-hart · 2024-06-12T20:23:48Z

backend/controller/evaljob_controller.go

+}
+
+func (r *EvalJobReconciler) createPod(job *lmevalservicev1beta1.EvalJob) *corev1.Pod {
+	var allowPrivilegeEscalation = false


Interesting. I think it should be safe to always have these settings set, right? We certainly want to allow these settings to be set, but we might want to make sure that it's possible for users to do otherwise. I could imagine the case where a user has a custom image with custom Task that does things like use temp files and read from disk. While this would certainly be an anti-pattern for a production-grade eval task, I wouldn't be at all surprised if users still wanted to be able to do it.

I wolud say this kind of images will hit the SCC on OpenShift and fail. I hope we don't need to support that use case.

gabe-l-hart · 2024-06-12T20:24:40Z

backend/controller/evaljob_controller.go

+	var allowPrivilegeEscalation = false
+	var runAsNonRootUser = true
+	var ownerRefController = true
+	var runAsUser int64 = 1001030000


I think we want to just let this setting be auto-assigned, right? We certainly want non-root, but I think the best practice is to not hard-code an explicit non-root UID and instead let a random (anonymous) one be assigned.

yes, this would be auto-assigned on OpenShift. I forgot to remove it since my local dev env is using kind. The proper way to support both local dev env and OpenShift is:

getting the uid range from the namespace's annotation

specify an ID in that range or a random id if the annotation is not found (which is other k8s cluster)

gabe-l-hart · 2024-06-12T20:25:21Z

backend/controller/evaljob_controller.go

+					UID:        job.UID,
+				},
+			},
+			Labels: map[string]string{


We'll want some kind of user passthrough label/annotation ability I think (useful for monitoring and visibility of all resources owned by a parent application).

let me create an issue to trace this feature
#7

gabe-l-hart · 2024-06-12T20:26:44Z

backend/controller/evaljob_controller.go

+				{
+					Name:            "driver",
+					Image:           r.options.DriverImage,
+					ImagePullPolicy: corev1.PullAlways,


I think this will probably need to be configurable too. I always struggle with this kind of controller on how much to make the CRD's spec include a generic "put anything to configure the pod here" section vs being prescriptive about the body of the pod.

Since this is used by the controller and we do have a configmap for the settings. I can add the image pull policy there.

gabe-l-hart · 2024-06-12T20:27:34Z

backend/controller/evaljob_controller.go

+					VolumeMounts: []corev1.VolumeMount{
+						{
+							Name:      "shared",
+							MountPath: "/opt/app-root/src/bin",


Interesting. Why is this emptydir mounting to a bin path?

oooh, I think I see now. Is this getting the binaries dropped into it during the init container? Why not just require that the pod's image have the necessary binaries? Maybe I'll get there once I make it to driver.go

Ah! I think I understand the role of the driver now. Is this basically just a wrapper around executing the python script that will push the status updates to the parent resource?

Assuming this is correct, I think we may need to be pretty careful here using a push model for status updates rather than a pull model where the controller polls the pod. The reason for this is that some security-focused users will want to isolate the pod where the job body runs in a namespace that fully disallows any RBAC for pods.

that's used to copy the driver from the driver image in the init container to lm-eval + unitxt image in the main container. So in the main container, we are still running the driver and have the driver to run the "lm-eval + unitxt" python program.

gabe-l-hart · 2024-06-12T20:28:21Z

backend/controller/evaljob_controller.go

+			SecurityContext: &corev1.PodSecurityContext{
+				RunAsNonRoot: &runAsNonRootUser,
+				SeccompProfile: &corev1.SeccompProfile{
+					Type: corev1.SeccompProfileTypeRuntimeDefault,


Somewhere in the back of my mind, this is different depending on the version of kubernetes/openshift you're running. We probably don't need to go back in time too far though?

I lost track too. but most of the recent k8s clusters need it.

gabe-l-hart · 2024-06-12T20:33:37Z

backend/controller/evaljob_controller.go

+						},
+					},
+					Command: generateCmd(job),
+					Args:    generateArgs(job),


I think we'll need resources here too eventually, especially for custom tasks that may do meaty work inline (vs calling out to an external model)

created the issue to track it: #8

gabe-l-hart · 2024-06-12T20:34:33Z

backend/controller/evaljob_controller.go

+	}
+
+	cmds := make([]string, 0, 10)
+	cmds = append(cmds, "python", "-m", "lm_eval", "--output_path", "/opt/app-root/src/output")


I think this will need to be a mounted volume and/or configurable. A user will need to be able to get the output results stored somewhere.

yes, let me create an issue for this feature too
#9

gabe-l-hart · 2024-06-12T21:42:20Z

backend/driver/driver.go

+		return err
+	}
+
+	if err := d.updateStatus(lmevalservicev1beta1.RunningJobState); err != nil {


I think this is going to pose a problem for running jobs without RBACs. There will definitely be users that want to do this, so I think we may need to consider if we can publish status updates to some medium that would be visible to the controller for polling. This might just be logs?

for now, if we want to avoid RBACs, the solutions I have on top of my mind for now are:

a polling mechanism on the controller which I don't like

internal API calls for the driver to update the status and upload logs

Let me also create an issue to track this
#10

gabe-l-hart · 2024-06-12T21:44:44Z

backend/driver/driver.go

+		d.job.Status.Message = err.Error()
+	} else {
+		// read the content of result*.json
+		pattern := filepath.Join(d.Option.OutputPath, "result*.json")


Is this "result*.json" something that is static to lm-eval, or is it something a user might configure at some point with a custom task?

that's static for lm-eval. but I will keep an eye on that to see if it's still true in the future release of lm-eval

gabe-l-hart · 2024-06-12T21:45:49Z

backend/driver/driver.go

+				if err != nil {
+					d.Option.Logger.Error(err, "failed to retrieve the results")
+				} else {
+					d.job.Status.Results = string(bytes)


Interesting. Looking at this, it seems like we're just putting the whole content of the output file into the job's status. If we expect that content to always be small, I think this is OK (and then I would think we remove the ability to set outputPath in the CR). If we think the user would need the ability to fetch an arbitrary set of output files, we're going to need some kind of storage pointer or mounted volume.

I don't know if the results would exceed 2.5 MB. I guess for the lm-evaluation job, it wouldn't. but we definitely need the storage feature. I believe I created an issue for this when replying to one of your comments above. #9

for the outputPath, I actually don't have it in my draft CRD :-p

- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>

yhwang added 3 commits June 4, 2024 11:39

Create a controller and CRD for the lm-eval job

f76f563

Signed-off-by: Yihong Wang <yh.wang@ibm.com>

add driver to wrap python program

43431ac

Signed-off-by: Yihong Wang <yh.wang@ibm.com>

Update manifests

0345a39

Signed-off-by: Yihong Wang <yh.wang@ibm.com>

yhwang merged commit 9bac6f9 into main Jun 12, 2024
3 checks passed

yhwang deleted the controller branch June 12, 2024 06:53

gabe-l-hart reviewed Jun 12, 2024

View reviewed changes

api/v1beta1/evaljob_types.go Show resolved Hide resolved

gabe-l-hart reviewed Jun 12, 2024

View reviewed changes

api/v1beta1/evaljob_webhook.go Show resolved Hide resolved

gabe-l-hart reviewed Jun 12, 2024

View reviewed changes

api/v1beta1/evaljob_webhook.go Show resolved Hide resolved

gabe-l-hart reviewed Jun 12, 2024

View reviewed changes

yhwang mentioned this pull request Jun 13, 2024

Address some comments for PR #5 #11

Merged

Controller Impl #5

Controller Impl #5

Conversation

yhwang commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabe-l-hart Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabe-l-hart Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabe-l-hart Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabe-l-hart Jun 12, 2024 •

edited

Loading

gabe-l-hart Jun 12, 2024 •

edited

Loading

gabe-l-hart Jun 12, 2024 •

edited

Loading