Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: add automated and on demand testing of fluence #49

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Dec 17, 2023

Problem: we cannot tell if/when fluence builds will break against upstream
Solution: have a weekly run that will build and test images, and deploy on successful results. For testing, I have added a complete example that uses Job for fluence/default-scheduler, and the reason is because we can run a container that generates output, have it complete, and there is no crash loop backoff or similar. I have added a complete testing setup using kind, and it is in one GitHub job so we can build both containers and load into kind, and then run the tests. Note that MiniKube does NOT appear to work for custom schedulers - I suspect there are extensions/plugins that need to be added. Finally, I was able to figure out how to programmatically check both the pod metadata for the scheduler along with events, and that combined with the output should be sufficient (for now) to test that fluence is working.

Summary

In summary this PR:

  • Adds a testing workflow, with triggers for on demand, weekly testing, and pull requests (deploy on all bug pull request)
  • The testing workflow still builds containers separately, and saves them .tar.gz to artifact caches to load in the final step
  • A new example directory to use fluence vs. the default-scheduler with jobs (that can complete)
  • A testing script that uses the jobs and checks as much metadata as I could find (pod and events)
  • Updates the manifests to allow for specifying the pull policy (for CI we do not want to allow pull, definitely not Always)
  • Updates the README to reflect the above, and removes "Under Construction" because (after these) we will be pretty good to not be in that state.

Interesting Things I Learned

  • MiniKube does not seem to work with a default scheduler out of the box (I got errors locally). This likely can be resolved with some digging, e.g., it looks like you can set it as a config off the bat. I wasn't interested in pursuring further because kind works better anyway.
  • The version of kind and kubectl matter! An older version of kind gave me an error with running fluence, and an older version of kubectl didn't have kubectl events (see below)

I found these commands useful to checking scheduler assignment. The first is the schedulerName (generated from the job)

default_scheduled_by=$(kubectl get pod ${pod} -o json | jq -r .spec.schedulerName)
# either "fluence" or "default-scheduler"

That worked for both. But it might be the case that the schedulerName we provide is not actually the one assigned (or maybe it doesn't run if it can't be satisfied, I'm not sure). Either way, makes sense to check via the event. And getting the event was more tricky - in both cases I was interested in the "Reason" -> "Scheduled." For fluence, I found the name under .reportingComponent, and for the default-scheduler that field was blank, and I found it under .source.component. For those interested, here are two events to compare.

Default Scheduler "Scheduled" Event
{
  "kind": "Event",
  "apiVersion": "v1",
  "metadata": {
    "name": "default-job-vrkwd.17a1c42e4317ec63",
    "namespace": "default",
    "uid": "bc1c4fa7-f8c8-41d1-8dce-7e055d4f2eac",
    "resourceVersion": "845",
    "creationTimestamp": "2023-12-18T00:03:57Z"
  },
  "involvedObject": {
    "kind": "Pod",
    "namespace": "default",
    "name": "default-job-vrkwd",
    "uid": "09b44c86-4cce-4d5a-986a-ef062a8715a8",
    "apiVersion": "v1",
    "resourceVersion": "841"
  },
  "reason": "Scheduled",
  "message": "Successfully assigned default/default-job-vrkwd to kind-control-plane",
  "source": {
    "component": "default-scheduler"
  },
  "firstTimestamp": "2023-12-18T00:03:57Z",
  "lastTimestamp": "2023-12-18T00:03:57Z",
  "count": 1,
  "type": "Normal",
  "eventTime": null,
  "reportingComponent": "",
  "reportingInstance": ""
}

And for fluence we actually see that source is empty (the opposite)

Fluence "Scheduled" Event
{
  "kind": "Event",
  "apiVersion": "v1",
  "metadata": {
    "name": "fluence-job-wmjqj.17a1c42e39ffd2c6",
    "namespace": "default",
    "uid": "51ab5daf-28d2-4b0f-9708-fb89870c89e6",
    "resourceVersion": "838",
    "creationTimestamp": "2023-12-18T00:03:56Z"
  },
  "involvedObject": {
    "kind": "Pod",
    "namespace": "default",
    "name": "fluence-job-wmjqj",
    "uid": "32c1771d-bc75-4368-bb8c-b9761ba34aef",
    "apiVersion": "v1",
    "resourceVersion": "834"
  },
  "reason": "Scheduled",
  "message": "Successfully assigned default/fluence-job-wmjqj to kind-control-plane",
  "source": {},
  "firstTimestamp": null,
  "lastTimestamp": null,
  "type": "Normal",
  "eventTime": "2023-12-18T00:03:56.943351Z",
  "action": "Binding",
  "reportingComponent": "fluence",
  "reportingInstance": "fluence-fluence-7d6c87f5cf-6cplb"
}

I thought that was interesting - it must be designed that the default-scheduler is not considered an extra component (and fluence is) and fluence is not considered some core kubernetes source. I have no idea, I'll probably Google around / ask people about that subtle difference. So here is the jq fu (jq is the best tool!) to get the exact output for each:

kubectl events --for pod/${fluence_job_pod} -o json  | jq -c '[ .items[] | select( .reason | contains("Scheduled")) ]' | jq -r .[0].reportingComponent
kubectl events --for pod/${default_job_pod} -o json  | jq -c '[ .items[] | select( .reason | contains("Scheduled")) ]' | jq -r .[0].source.component

This might take a few iterations to get working in CI (I haven't used this setup kind action before) and I can ping folks when it is done.

Ok, everything is set. Ping @cmisale and @milroy for review, and of course no rush, it's ready when we need it!

@vsoch vsoch force-pushed the add/automated-testing branch 9 times, most recently from ab0c6ec to a85f0ec Compare December 18, 2023 00:07
Problem: we cannot tell if/when fluence builds will break against upstream
Solution: have a weekly run that will build and test images, and deploy on
successful results. For testing, I have added a complete example that uses
Job for fluence/default-scheduler, and the reason is because we can run
a container that generates output, have it complete, and there is no crash
loop backoff or similar. I have added a complete testing setup using kind,
and it is in one GitHub job so we can build both containers and load into
kind, and then run the tests. Note that MiniKube does NOT appear to work
for custom schedulers - I suspect there are extensions/plugins that need
to be added. Finally, I was able to figure out how to programmatically
check both the pod metadata for the scheduler along with events, and
that combined with the output should be sufficient (for now) to test
that fluence is working.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you for all your help with this infrastructure.

Comment on lines +46 to +47
kubectl apply -f fluence-job.yaml
kubectl apply -f default-job.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that scheduling pods with kube-scheduler and Fluence on the same cluster isn't supported. There isn't currently any way to propagate pod-to-node mappings generated by kube-scheduler to Fluence.

It's important that kubectl apply -f fluence-job.yaml is executed before kubectl apply -f default-job.yaml, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.

Regardless, there still may be some funky race condition that occurs and results in unschedulable pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important that kubectl apply -f fluence-job.yaml is executed before kubectl apply -f default-job.yaml, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.

Gotcha - I think likely for the testing cluster (and the example we already had in main) we are just doing that, putting them on the same node, and since it's a tiny kind or otherwise local cluster, there hasn't been an issue. If we extended this to an actual setup, there would be. This is an important point and I've opened an issue for emphasizing it in in future docs: #53 and maybe we can think of a creative way to allow for both, possibly with kueue resource flavors that create distinct (separate) resources that are labeled for each.

@milroy
Copy link
Member

milroy commented Jan 5, 2024

Adds a testing workflow, with triggers for on demand, weekly testing, and pull requests (deploy on all bug pull request)
[...]
Updates the README to reflect the above, and removes "Under Construction" because (after these) we will be pretty good to not be in that state.

This will be extremely helpful to reduce drift from upstream.

@vsoch
Copy link
Member Author

vsoch commented Jan 5, 2024

This will be extremely helpful to reduce drift from upstream.

Huge agree! It will be much easier to fix tiny issues that pop up along the way, and I volunteer to take charge of monitoring that (and opening PRs with any fixes that are needed). This setup is also useful for making sure the containers (fluence and sidecar) we are deploying (at the same frequency) are provided with the latest build (that works) combined with kubernetes-sigs/scheduler-plugins.

@vsoch vsoch merged commit a7795be into modular-fluence-build Jan 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants