-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller Impl #5
Conversation
Signed-off-by: Yihong Wang <yh.wang@ibm.com>
Signed-off-by: Yihong Wang <yh.wang@ibm.com>
Signed-off-by: Yihong Wang <yh.wang@ibm.com>
// +kubebuilder:subresource:status | ||
|
||
// EvalJob is the Schema for the evaljobs API | ||
type EvalJob struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to be careful about how we name the CR. I think EvalJob
is too vague since there are numerous things that users may want to evaluate in a kube cluster. I think we could call it LMEvalJob
to explicitly link it to the lm-eval
python package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing. This should be related to your other comment about the GroupName
and Kind
. The reason I have the EvalJob
in the first place is that the GroupName is lm-eval-service.github.com
.
If we change the GroupName to foundation-model-stack.github.com
, then the LMEvalJob
is definitely better.
Context("When creating EvalJob under Defaulting Webhook", func() { | ||
It("Should fill in the default value if a required field is empty", func() { | ||
|
||
// TODO(user): Add your logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should do these!
|
||
var ( | ||
// GroupVersion is group version used to register these objects | ||
GroupVersion = schema.GroupVersion{Group: "lm-eval-service.github.com", Version: "v1beta1"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use something like foundation-model-stack.github.com
as the group
with LMEvalJob
as the kind
. Feels like we may have other kinds
we want to bundle under this. I could also see not coupling this to other projects in the org. If we go that route, I'd drop the "service"
since this is not managing the REST microservice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the GroupName to foundation-model-stack.github.com
and Kind to LMEvalJob
. Then we can expand to other kinds for other services in the future.
type EvalJobSpec struct { | ||
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster | ||
// Important: Run "make" to regenerate code after modifying this file | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these fields under spec
relate to the CLI args of the job. Do you think we need any further config around the job itself? I would think at a minimum we'd need:
- image name
- resources
- passthrough labels and annotations
- config for mounting storage volumes for inputs/outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also eventually need the ability to run sidecars and the like inside the job (e.g. for proprietary storage connectivity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do use ConfigMap
to store some settings which may relate to the things you listed. We definitely need to expand the settings to support more configuration-wide resources.
} | ||
|
||
type ServiceOptions struct { | ||
PodImage string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anywhere that this can be set in the CRD that I'm not seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it's not linked to any field in the CRD yet. I use ConfigMap
to store these settings for now. we can have further discussion about how this should be mapped to the REST API and how it converts to CRD fields and etc.
return ctrl.Result{}, client.IgnoreNotFound(err) | ||
} | ||
|
||
if !evalJob.ObjectMeta.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Is this how a finalizer
is triggered for a kubebuilder
controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, for the controller, it's reconciliation fashion, which means you don't know which event triggers the reconcile
function. and the DeletionTimestamp is the field that says the current state of the CR is being deleted. After the CR is deleted, there will be another reconcile
call. but I use a predicate
to ignore the deletion events since we have nothing to do after the CR has been deleted.
return r.checkScheduledPod(ctx, log, evalJob) | ||
case lmevalservicev1beta1.CompleteJobState: | ||
return r.handleComplete(ctx, log, evalJob) | ||
case lmevalservicev1beta1.CancelledJobState: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have cancellation separate from deletion? I could see that making sense so that the status
and artifacts from a cancelled job. I think we may also want to consider some kind of retention policy for old jobs so that they don't pile up in the kube master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Cancelled
state is used to trigger the cancellation and is different from the Complete
state in the current design. After a job is canceled, it will change to the Complete
state and the reason will be Cancelled
. Yes, we do need a retention policy but it could be either a separate mechanism from the controller or in the controller. We need to figure out a good way to consolidate reconcile-based and time-based logic.
client.Client | ||
Scheme *runtime.Scheme | ||
Recorder record.EventRecorder | ||
ConfigMap string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user point to this ConfigMap
? I'm not seeing it in the CRD (could be missing it somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to use the ConfigMap
to store settings for the controller and the fields in the REST APIs. For the sake of the argument, if there is a size
property in the REST API that allows users to specify the pod's spec. we may have a big-size
settings in the ConfigMap
and it contains the 4000m CPU and 8GB memory settings. and a median and small size settings. So it's not directly used by the CR, but somehow related.
// construct a new pod and create a pod for the job | ||
currentTime := v1.Now() | ||
pod := r.createPod(job) | ||
if err := r.Create(ctx, pod, &client.CreateOptions{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with how kubebuilder
controllers get wired into the operator process. Is there any chance of a race here where the check to see if it's a new CR says "yes," but by the time this r.Create
happens it's already been reconciled by another reconciliation process or a sibling operator? I suppose there's always the race that a user might manually make the pod themselves, but we probably don't need to account for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha, you are talking about the scaling of the controller now. for scaling, each controller should work on its own bucket and different CRs should go to different buckets. At least that's how knative works. I will check on how multiple replica of the kubebuilder
controller works. For the racing condition, theoretically, one reconciliation would throw an error and the other reconciliation would pass. So we can have a logic to validate the pod status to correct the state of the CR. I noted this down and will add some logic to handle that.
|
||
if pod.Status.ContainerStatuses == nil { | ||
// wait for the pod to initialize and run the containers | ||
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth making this requeue timing configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I will put it into the ConfigMap
for _, cstatus := range pod.Status.ContainerStatuses { | ||
if cstatus.Name == "main" { | ||
if cstatus.LastTerminationState.Terminated == nil { | ||
return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 10}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These conditionals of when a short-circuit return
is used vs when the status is updated without returning and it falls through to the final return
are a little tangled. I think this could all be done more simply by pulling the "main"
container out instead of looping and then having all conditionals fall through to a single return
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. but the containerStatuses is not map-based structure, need to loop through them to get the main
one. let me see if I can simplify this part.
|
||
func (r *EvalJobReconciler) getPod(ctx context.Context, job *lmevalservicev1beta1.EvalJob) (*corev1.Pod, error) { | ||
var pod = corev1.Pod{} | ||
if err := r.Get(ctx, types.NamespacedName{Namespace: job.Namespace, Name: job.Name}, &pod); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's using the name of the job
as the name of the pod
? Is that always going to be the case? Seems like we might want to do some kind of hash suffix to correlate with the kind of naming used for other parent/child resource types like Job
or ReplicaSet
. This would be a nice shorthand for users to identify the pod as "machine created" vs "human created."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the pod name is stored in the status subresource. we definitely can have our own naming logic. apology for lazy on this in the first draft. :-p. I put the OwnerReference
in the pod, so even from the pod, we know which job it belongs to.
return &pod, nil | ||
} | ||
} | ||
return nil, fmt.Errorf("pod doesn't have proper entry in the OwnerReferences") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this error go if it triggers? Would this prevent future reconciliations from mucking with the pod? In other words, could a user manually remove the ownerReferences
entry to cause the controller to "disown" the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if users manually modify the OwnerReference entry, then the controller won't be able to get Delete
events for the pod and no reconcile cycle. That's how the controller works. But with the driver embedded in the pod. We still know whether the job is done or not
APIVersion: "v1", | ||
}, | ||
ObjectMeta: v1.ObjectMeta{ | ||
Name: job.Status.PodName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, here it looks like we're pulling the pod name from status.podName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, store it here. so we can have some naming logic for the pod. and because of the reconcile mechanism, we can only use the data in the LMEvalJob
data struct to get the corresponding Pod
. the OwnerRefernece is for watching the deletion event from the pod
fmt.Sprintf("Tthe EvalJob %s in namespace %s has completed", | ||
job.Name, | ||
job.Namespace)) | ||
// TODO: final wrap up/clean up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be interesting. I suspect one thing that would be really important is persisting the pod's logs somewhere that the user would be responsible for cleaning up. Those could probably get pretty verbose, but it would be great to be able to store the pod logs with the job somehow.
The internet is not very helpful, which indicates that it's probably not possible, but it would be really cool if we could somehow extend the kube logging architecture such that kubectl logs lmevaljob/<jobname>
would do the same thing as kubectl logs <job pod name>
. Probably not a thing though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting idea. right now, the stderr/stdout are stored in the specific location in the pod and the driver
only retrieves the results.json
. the logs would only contain information/messages from the driver for now.
we can adjust the behavior in the driver and store the logs somewhere else. Not sure if we can customize the kubectl get lmevaljob
command to get logs. I know you can customize the get
in the CRD but the info should come from its status.
} | ||
|
||
func (r *EvalJobReconciler) handleCancel(ctx context.Context, log logr.Logger, job *lmevalservicev1beta1.EvalJob) (ctrl.Result, error) { | ||
// delete the pod and update the state to complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same concern here around log retention would be at play. A user may want to retain the logs from the pod even after cancellation so that they can analyze the progress/problems from the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. we can do that in the driver directly.
log.Error(err, "failed to update status for cancellation") | ||
} | ||
r.Recorder.Event(job, "Normal", "Cancelled", | ||
fmt.Sprintf("Tthe EvalJob %s in namespace %s has cancelled and changed its state to Complete", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: s,Tthe,The,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix this in the new PR
} | ||
|
||
func (r *EvalJobReconciler) createPod(job *lmevalservicev1beta1.EvalJob) *corev1.Pod { | ||
var allowPrivilegeEscalation = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I think it should be safe to always have these settings set, right? We certainly want to allow these settings to be set, but we might want to make sure that it's possible for users to do otherwise. I could imagine the case where a user has a custom image with custom Task
that does things like use temp files and read from disk. While this would certainly be an anti-pattern for a production-grade eval task, I wouldn't be at all surprised if users still wanted to be able to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wolud say this kind of images will hit the SCC on OpenShift and fail. I hope we don't need to support that use case.
var allowPrivilegeEscalation = false | ||
var runAsNonRootUser = true | ||
var ownerRefController = true | ||
var runAsUser int64 = 1001030000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to just let this setting be auto-assigned, right? We certainly want non-root, but I think the best practice is to not hard-code an explicit non-root UID and instead let a random (anonymous) one be assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this would be auto-assigned on OpenShift. I forgot to remove it since my local dev env is using kind
. The proper way to support both local dev env and OpenShift is:
- getting the uid range from the namespace's annotation
- specify an ID in that range or a random id if the annotation is not found (which is other k8s cluster)
UID: job.UID, | ||
}, | ||
}, | ||
Labels: map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want some kind of user passthrough label/annotation ability I think (useful for monitoring and visibility of all resources owned by a parent application).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me create an issue to trace this feature
#7
{ | ||
Name: "driver", | ||
Image: r.options.DriverImage, | ||
ImagePullPolicy: corev1.PullAlways, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will probably need to be configurable too. I always struggle with this kind of controller
on how much to make the CRD's spec
include a generic "put anything to configure the pod here" section vs being prescriptive about the body of the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is used by the controller and we do have a configmap for the settings. I can add the image pull policy there.
VolumeMounts: []corev1.VolumeMount{ | ||
{ | ||
Name: "shared", | ||
MountPath: "/opt/app-root/src/bin", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Why is this emptydir mounting to a bin
path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh, I think I see now. Is this getting the binaries dropped into it during the init container? Why not just require that the pod's image have the necessary binaries? Maybe I'll get there once I make it to driver.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I think I understand the role of the driver
now. Is this basically just a wrapper around executing the python script that will push the status updates to the parent resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this is correct, I think we may need to be pretty careful here using a push model for status updates rather than a pull model where the controller polls the pod. The reason for this is that some security-focused users will want to isolate the pod
where the job body runs in a namespace that fully disallows any RBAC for pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's used to copy the driver from the driver
image in the init container to lm-eval + unitxt
image in the main container. So in the main container, we are still running the driver
and have the driver to run the "lm-eval + unitxt" python program.
SecurityContext: &corev1.PodSecurityContext{ | ||
RunAsNonRoot: &runAsNonRootUser, | ||
SeccompProfile: &corev1.SeccompProfile{ | ||
Type: corev1.SeccompProfileTypeRuntimeDefault, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in the back of my mind, this is different depending on the version of kubernetes/openshift you're running. We probably don't need to go back in time too far though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost track too. but most of the recent k8s clusters need it.
}, | ||
}, | ||
Command: generateCmd(job), | ||
Args: generateArgs(job), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need resources
here too eventually, especially for custom tasks that may do meaty work inline (vs calling out to an external model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created the issue to track it: #8
} | ||
|
||
cmds := make([]string, 0, 10) | ||
cmds = append(cmds, "python", "-m", "lm_eval", "--output_path", "/opt/app-root/src/output") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will need to be a mounted volume and/or configurable. A user will need to be able to get the output results stored somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me create an issue for this feature too
#9
return err | ||
} | ||
|
||
if err := d.updateStatus(lmevalservicev1beta1.RunningJobState); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to pose a problem for running jobs without RBACs. There will definitely be users that want to do this, so I think we may need to consider if we can publish status updates to some medium that would be visible to the controller
for polling. This might just be logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, if we want to avoid RBACs, the solutions I have on top of my mind for now are:
- a polling mechanism on the controller which I don't like
- internal API calls for the driver to update the status and upload logs
Let me also create an issue to track this
#10
d.job.Status.Message = err.Error() | ||
} else { | ||
// read the content of result*.json | ||
pattern := filepath.Join(d.Option.OutputPath, "result*.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this "result*.json"
something that is static to lm-eval
, or is it something a user might configure at some point with a custom task
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's static for lm-eval
. but I will keep an eye on that to see if it's still true in the future release of lm-eval
if err != nil { | ||
d.Option.Logger.Error(err, "failed to retrieve the results") | ||
} else { | ||
d.job.Status.Results = string(bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Looking at this, it seems like we're just putting the whole content of the output file into the job's status. If we expect that content to always be small, I think this is OK (and then I would think we remove the ability to set outputPath
in the CR). If we think the user would need the ability to fetch an arbitrary set of output files, we're going to need some kind of storage pointer or mounted volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the results would exceed 2.5 MB. I guess for the lm-evaluation job, it wouldn't. but we definitely need the storage feature. I believe I created an issue for this when replying to one of your comments above. #9
for the outputPath
, I actually don't have it in my draft CRD :-p
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
- Add `image-pull-policy` in the ConfigMap to specify the imagePullPolicy of the job's pod - Add `pod-checking-interval` in the ConfigMap to set the pod checking interval - Update the Group and Kind of the CRD to `foundation-model-stack.github.com.github.com` and `LMEvalJob` - Refine the `checkScheduledPod` func of the controller based on the comment Signed-off-by: Yihong Wang <yh.wang@ibm.com>
Use the kubebuilder to generate the skeleton of the controller and implement the first version of
EvalJob
CRDEvalJob
CRD