feat: support for Flux Framework as HPC manager #3064

vsoch · 2026-01-04T09:11:49Z

This pull request adds Flux Framework as a plugin to the Kubeflow Trainer. 🌀

Overview

Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the example.

What this PR does / why we need it:

See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc. To summarize, Flux Framework supports more MPI variants out of the box than the current MPI plugin. It brings more scheduling features, topology awareness, higher throughput, and dynamism / elasticity of the scheduler and jobs. See https://github.com/kubeflow/trainer/tree/master/docs/proposals/2841-flux-hpc#motivation. For full provenance / history, here is the initial discussion in the Kubeflow Trainer meeting.

Which issue(s) this PR fixes
Fixes #2841 (and note here, we should follow up with discussion on next steps for scoped issues)

Checklist:

Docs included if any changes are user facing

See kubeflow/website#4283

@andreyvelich some notes for you.

This is my first time developing for Kubeflow, and using ApplyConfiguration. If I made a mistake in design or process please tell me directly, and give a pointer to a correct way to go about it.
I am exposing the variables we talked about (network, queue policy) as envars that can be defined in the training job. I think this is a simple and reasonable approach in that it is flexible, but let me know if there is another idea for discussion.

Here is the first completion of LAMMPS. When you remove the command it turns into an interactive minicluster (fairly simple / straight-forward I think).

Thanks in advance for the review! I won't be able to finish the PR work tonight (figuring out the linting still) but I'll pick up tomorrow after some sleep. Really excited about this.

cc @milroy

Copilot

Pull request overview

This pull request adds support for Flux Framework as an HPC workload manager plugin for the Kubeflow Trainer. Flux Framework provides sophisticated resource management, supports multiple MPI variants, and enables distributed HPC workloads in Kubernetes environments.

Key Changes

Implements a new Flux plugin that integrates with the Kubeflow Trainer runtime framework
Adds automatic Flux installation via init containers and configuration management through ConfigMaps and Secrets
Provides support for both batch execution and interactive HPC cluster modes

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
pkg/runtime/framework/plugins/flux/*.go	Core plugin implementation including broker configuration, curve certificate generation, hostlist management, and command extraction
pkg/runtime/framework/plugins/flux/*_test.go	Comprehensive test coverage for plugin functionality
pkg/runtime/framework/plugins/registry.go	Registers the Flux plugin in the framework
pkg/runtime/runtime.go	Extends RuntimePolicy to include FluxPolicySource
pkg/apis/trainer/v1alpha1/trainingruntime_types.go	Adds FluxMLPolicySource type definition with numProcPerNode parameter
pkg/apis/trainer/v1alpha1/zz_generated.*	Generated code for deepcopy, openapi specs, and API types
pkg/client/applyconfiguration/*/.go	Generated apply configurations for Flux types
manifests/base/crds/*.yaml	Updated CRDs to include Flux policy configuration
charts/kubeflow-trainer/crds/*.yaml	Updated Helm chart CRDs
examples/flux/*.yaml	Example runtime and TrainJob configurations demonstrating LAMMPS workload
examples/flux/README.md	Comprehensive documentation for using the Flux plugin
api/python_api/*/.py	Python API updates to support Flux policy types
api/openapi-spec/swagger.json	OpenAPI specification updates
build.sh	Development helper script (should be removed per PR description)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/runtime/core/core.go

pkg/runtime/framework/plugins/flux/flux.go

pkg/runtime/framework/plugins/flux/hostlist.go

pkg/runtime/framework/plugins/flux/curve.go

pkg/runtime/framework/plugins/flux/command.go

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_flux_ml_policy_source.py

api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_hpcml_policy_source.py

coveralls · 2026-01-04T09:16:32Z

Pull Request Test Coverage Report for Build 21196269867

Details

441 of 585 (75.38%) changed or added relevant lines in 4 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+4.7%) to 56.059%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/runtime/framework/plugins/registry.go	0	1	0.0%
pkg/runtime/runtime.go	0	1	0.0%
pkg/runtime/framework/plugins/flux/flux.go	430	572	75.17%

Files with Coverage Reduction	New Missed Lines	%
pkg/runtime/runtime.go	1	57.69%

Totals
Change from base Build 21192172860:	4.7%
Covered Lines:	1684
Relevant Lines:	3004

💛 - Coveralls

andreyvelich

Thank you for this effort @vsoch!
I left my initial comments.
/assign @akshaychitneni @astefanutti @Electronic-Waste @tenzen-y
Appreciate your review too!

build.sh

examples/flux/README.md

manifests/base/runtimes/flux/flux-runtime.yaml

examples/flux/lammps-train-job.yaml

pkg/runtime/framework/plugins/flux/flux.go

andreyvelich · 2026-01-19T00:51:04Z

pkg/runtime/framework/plugins/flux/flux.go

+	// Generate hostlists. The hostname (prefix) is the trainJob Name
+	// We need the initial jobset size, and container command
+	size := getJobSetSize(trainJob)
+	hosts := generateHostlist(trainJob.Name, size)


Check this how we get the host list:

trainer/pkg/runtime/framework/plugins/mpi/mpi.go

Line 317 in 022656a

hostFile.WriteString(fmt.Sprintf("%s slots=%d\n", e, slots))

You can extract the address of the Pods by using Endpoint from PodSet.

Can you show me what one of those addresses look like? Generally we don't want the entire address, but a pattern (and range) that describes it. The code there appears to be generating a massive list of host, which won't scale nicely for a config file.

It looks like this: trainjob-node-1-0.trainjob.
We use this in our hostfile configuration for the default MPI plugin.

okay - sounds like the fully qualified name isn't enabled (the bit with the cluster.local). We will need that for flux.

Do you need cluster.local since it tries to reach the pods inside the cluster?
For MPI hostfile it works well. Do you want to try to use the same hostnames in Flux config?

For the broker setup, I was never able to get it to work without the full name. I don't remember the details (this was many years ago), but I am sure about that. We also can't assume that we will always be connecting to only local nodes, or even Kubernetes. For context, this is built into the bootstrap config. Here is an example:

https://github.com/flux-framework/flux-operator/blob/321a9e14d3180f9c0bae72bb8e1ab47b98069f82/examples/tests/custom-config/minicluster.yaml#L30-L36

Looking at this again - is there any reason we cannot use the method that we currently have to generate the hosts? It's fairly simple and seems to work OK.

pkg/runtime/framework/plugins/flux/command.go

pkg/runtime/framework/plugins/flux/hostlist.go

vsoch · 2026-01-19T01:27:14Z

Thank you @andreyvelich - I will get started on these changes right away.

I wanted to get a GPU example in and was testing with AWS Trainium - the CPU example worked beautifully but I couldn't get the Trainium devices to work (in any context, even with their tutorials, etc). I think we will get there (we have a great collaborator there!), but in the meantime I'm going to try a small example on Google Cloud, likely with any small GPU I can get.

I will keep you in the loop to my progress, update the PR, and we will be attending the next Kubeflow meeting to discuss any details that come up.

google-oss-prow · 2026-01-19T03:31:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Flux supports the majority of MPI flavors/variants, and can be used to bootstrap MPI as a plugin. It adds other features for scheduling and topology that can be used for simulations and ai/ml jobs. This changeset adds the plugin implementation, including the plugin module, tests, and an example with a small README to serve as documentation for the time being. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

pkg/runtime/framework/plugins/flux/flux.go

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

akshaychitneni · 2026-01-21T17:21:43Z

pkg/runtime/framework/plugins/flux/flux.go

+	// Ensure we don't have an initContainer named flux-installer
+	js, ok := runtime.TemplateSpecApply[v1alpha2.JobSetSpecApplyConfiguration](runtimeInfo)
+	if !ok || js == nil {
+		return nil, allErrs


Missing specific error message for this case?

What would the error look like if it isn't in reference to a specific field on a spec? This is a general "There is no runtime info and there should be" error.

vsoch · 2026-01-25T20:03:11Z

OK - one of the changes above actually seems to have broken the entire setup, because we don't have a view and are running lammps independently on each pod. I need to revert everything and start over.

vsoch · 2026-01-25T20:13:43Z

Still no go.

The lesson here is not to rebase until the end - one of the changes (maybe to move between files, etc) completely broke the setup, and I don't have the original version that I spent a long time on. There is no longer an init container, even when I move back to what is here. I don't know this will be done by mid- February @andreyvelich.

Update: I was able to restore back to (mostly) what (I think) I had, and the configmap and secret are generating again. I think there is still one bug to work through before I start this new work, but I'm relieved that it's partially back. This week is going to be busy so I'll set expectations for next weekend or after for another update.

andreyvelich · 2026-01-26T16:57:23Z

Update: I was able to restore back to (mostly) what (I think) I had, and the configmap and secret are generating again. I think there is still one bug to work through before I start this new work, but I'm relieved that it's partially back. This week is going to be busy so I'll set expectations for next weekend or after for another update.

Sounds good, thank you for your work! Let us know if you have any additional questions for the runtime extension framework.

Copilot AI review requested due to automatic review settings January 4, 2026 09:11

google-oss-prow bot requested review from jinchihe and kuizhiqing January 4, 2026 09:11

google-oss-prow bot added the size/XXL label Jan 4, 2026

Copilot started reviewing on behalf of vsoch January 4, 2026 09:12 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

vsoch force-pushed the plugin/flux branch 4 times, most recently from 3b23be6 to 022656a Compare January 4, 2026 14:58

andreyvelich mentioned this pull request Jan 17, 2026

AI Gateway WG should aim to standardize APIs for real time inference and offline inference (batch) kubernetes-sigs/wg-ai-gateway#4

Open

andreyvelich reviewed Jan 19, 2026

View reviewed changes

google-oss-prow bot assigned akshaychitneni, astefanutti, Electronic-Waste and tenzen-y Jan 19, 2026

vsoch mentioned this pull request Jan 19, 2026

trainer: Flux policy user-guide kubeflow/website#4283

Open

3 tasks

vsoch force-pushed the plugin/flux branch from 022656a to 12b43b0 Compare January 19, 2026 03:30

vsoch force-pushed the plugin/flux branch 2 times, most recently from a35f80f to 278cd0e Compare January 19, 2026 04:30

vsoch force-pushed the plugin/flux branch from 278cd0e to 8bb88a5 Compare January 19, 2026 04:32

akshaychitneni reviewed Jan 21, 2026

View reviewed changes

pkg/runtime/framework/plugins/flux/flux.go Show resolved Hide resolved

api: CEL policy to FluxPolicy NumProcPerNode

c6b8fa7

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the plugin/flux branch from 666785d to c6b8fa7 Compare January 21, 2026 03:35

akshaychitneni reviewed Jan 21, 2026

View reviewed changes

feat: support for Flux Framework as HPC manager #3064

Are you sure you want to change the base?

feat: support for Flux Framework as HPC manager #3064

Uh oh!

Conversation

vsoch commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21196269867

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vsoch commented Jan 19, 2026

Uh oh!

google-oss-prow bot commented Jan 19, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vsoch commented Jan 25, 2026

Uh oh!

vsoch commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

vsoch commented Jan 4, 2026 •

edited

Loading

coveralls commented Jan 4, 2026 •

edited

Loading

vsoch commented Jan 25, 2026 •

edited

Loading